DOCUMENT RESUME 



ED 322 216 



TM 015 464 



AUTHOR 
TITLE 



INSTITUTION 

iJPONS AGENCY 

REPORT NO 
PUB DATE 
GRANT 
NOTE 

AVAILABLE FROM 

PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Beaton, Albert E*; Zwick, Rebecca 

The Effect of Changes in the National Assessment: 

Disentangling the NAEP 1985-86 Reading Anomaly. 

Revised. 

National Assessment of Educational Progress, 
Princeton, NJ. 

National Center for Education Statistics (ED), 
Washington, DC. 
ETS-IT-TR-Zl 
Feb 90 

OERI-G-008720335 

248?.; Other contributors to this report are Kentaro 
Yamamoto, Robert j. Mislevy, Eugene g. Johnson, and 
Keith F. Rust. 

Natic^nai Assessment of Educational Progress, 
Educational Testing Service, Rosedale Road, 
Princeton, NJ 08541-0001. 
Reports - Evaluative/Feasibility (142) 

MFOI/PCIO Plus Postage. 

Academic Achievement; Educational Change; 

^Educational Trends; Elementary School Students; 

Elementary Secondary Education; Item Analysis; 

^National Programs; ^Reading Tests? Secondary School 

Students; *Standardized Tests; Statistical Analysis; 

♦Testing Programs; Trend Analysis 

♦National Assessment of Educational Progress 



ABSTRACT 

Results of new research into the anomalv^us results of 
the 1986 reading portion of the National Assessment of Educational 
Progress (NAEP)' are reported. The original analysis of the 1986 data 
indicated that the estimated performance level of 9- and 17-year-old 
students had dropped dramatically since 1984, whereas the performance 
of 13-year-olds had increased very slightly. Since it was unlikely 
that such large changes could have taken place in such a short time, 
the results were not presented to the general public and publication 
of the results was suspended until further research into their 
accuracy could be completed or until other corroborating evidence for 
the declines could be found. To this end, the design of the 1988 NAEP 
was modified^ and new data were collected and analyzed. The main 
finding from* the 1988 data analyses is that the changes in assessment 
booklets and procedures that were introduced in 1986 had a 
substantial and unpredictable effect on the estimates of performance. 
Although many of the same items were used in both the 1984 and 1986 
assessments, student performance on these items differed 
substantially when the items were administered in different contexts. 
Discovered differences between the 1986 and 1988 assessments were 
used in a design to estimate the reading performance of students in 
the 1986 sample. The new estimates show that reading performance 
declined slightly in 1986 at all age levels as compared to 1984. The 
declines, however, are not statistically significant. The new data 
also indicate a rebound in 1988 from the 1986 levels to about the 
same level of performance exhibited in 1984. The report contains 8 
chapters by Beaton, Zwick, Yamamoto and Mislevy detailing the 
research and analysing the data. Appendices include various types of 
statistical and methodological information. Twenty-five graphs and 33 
data tables are included. (TJH) 



THE NATION'S 




Albert E. Beaton • Rebecca Zwick 
In collaboration with — 

Kentaro Yamamoto • Robert J. Mislevy • Eugene G. Johnson • Keith R Rust 
with a Foreword by John W. Tlikey 

REVISED FEBRUARY 1990 



This report, No. 17''TR-21, can be ordered from the National Assessment of Educational Progress, Educational Testmg 
Service, Rosedale Road, Princeton, New Jersey 08541*0001. 

The contents of this booklet were developed un nt from the Department of Education. However, those contents do 
not necessarily reflect the policy of the Departmt. . education, and ;ou should not assume endorsement by the Federal 
Government. 

The work upon which this publication is based was performed pursuant to Grant No. G-008720335 of the Office of 
Educational Research and Improvement. 

Educational Testing Service is an equal opportunity/affirmative action employer. 

Educational Testing Service, ETS^ and are registered trademarks of Educational Testing Service. 



ERIC 



3 



THE EFFECT OF CHANGES IN THE NATIONAL ASSESSMENT: 
DISENTANGLING THE NAEP 1985-86 READING ANOMALY 



Contents 

Acknowledgments v 

Foreword vii 

John W. Tukey 

Executive Summary , , xi 

Albert E. Beaton 

Chapter 1 Introduction 1 

Albert E. Beaton 

Chapter 2 Summary of Earlier Research on the Reading Anomaly 15 
Albert E. Beaton 

Chapter 3 The Redesign of the 1988 Assessment 27 

Albert E. Beaton 

Chapter 4 Overview of Results 41 

Albert E. Beaton, Rebecca Zwick, 
Kentaro Yamamoto 

Chapter 5 Analyses of 1988 Reading Bridge Data 63 

Rebecca Zwick 

Chapter 6 Adjustment of the 1986 Reading Results to 

Allow for Changes in Item Order and Context . . 87 
Rebecca Zwick 

Chapter 7 Mathematics and Science Trend Data Analysis . . Ill 
Kentaro Yamamoto 



iii 



ERIC 



Contents ( continued) 



Chapter 8 Item-by- form Variation in 1984 and 1986 NAEP 

Reading Surveys 145 

Robert J. Mislevy 

Chapter 9 Epilogue 165 

Albert E. Beaton 

* * * 

Appendix A Summary of Modifications in Reading Scale 

Results Used in Tables and Figures 169 

Appendix B Sampling and Weighting Procedures 

for the 1988 NAEP Bridges 173 

Eugene G. Johnson, Keith F. Rust 

Appendix C Revision of Poststratification Weights for 

Age 9/Grade 4 and Age 13/Grade 8, 1984 NAEP . . 195 
Keith F. Rust 

Appendix D Conditioning Effects and IRT Parameters for 

Reading, T >:hematics, and Science Items .... 199 

Appendix E Estimation of the Standard Errors of 

the Adjusted 1986 NAEP Results 233 

Eugene G. Johnson, Robert J. Mislevy, 
Rebecca Zwick 

References 241 



iv 



ERIC 



5 



ACKNOWLEDGMENTS 



This report was produced through the efforts of a team of talented and 
dedicated professionals. Our profound appreciation goes to all of them. 
Those who have made especially significant contributions are acknowledged 
below. 

We are particularly indebted to the members of NAEP'j; Design and 
Analysis Committee- -John Carroll, Robert Glaser, Bert Green, Sylvia Johnson, 
Robert Linn, Ingram Olkin, Tej Pandey, Richard Snow, and John Tukey-'for their 
invaluable advice and suggestions. 

We are grateful to the National Center for Education Statistics of the 
Office of Educational Research and Improvement, and in particular to Emerson 
Elliott, David Sweet, Gary Phillips, and Eugene Owen for their support, help, 
and suggestions. We also appreciate the worthwhile comments provided by the 
group of anonyinous reviewers appointed by the National Center. 

Special thanks go to the ETS internal review committee of William 
Angoff , Gregory Anrig, Nancy Cole, Garlie Forehand, Eugene Johnson, Ann 
Jungeblut, Archie Lapointe, Samuel Messick, Robert Mislevy, and Ina Mullis for 
their important contributions and their encouragement. Special thanks also go 
to reviewers from Westat, including Morris Hansen, Keith Rust, ^.nd Renee 
Slobasky. 

In addition, we are indebted to Juliet Shaffer and Fritz Drasgow, two 
specially appointed external reviewers, for their detailed and thoughtful 
coimnents. Kent Ashworth^ John Barone, David Freund, Jo-Ling Liang, Terry 
Salinger, and Kentaro Yamamoto deserve special thanks for their careful 
reviews and constructive observations. 

Essential to the report are statistical programming and the tabular and 
graphic presentation of NAEP data, which were carried out through the 
conscientious efforts of David Freund, Maxine Kingston, Edward Kulick, Jo-Ling 
Liang, Michael Narcowich, Kate Pashley, and Minhwei Wang. 

We also express our special gratitude to Debbie Kline, who supervised 
the preparation of the report and provided valuable editorial assistance. 
Kent Ashworth and Jan Askew coordinated the production of the publication. 



FOREWORD 



For decades I have been uncomfortable about behavioral science's failure 
to take the defects of measurement as seriously as seemed appropriate to one 
initially trained as a chemist. Appearances of such failures varied from 
highly quantitative matters like at least thinking about re-expressing 
observed numbers, to quite qualitative matters like thinking about 
possibilities of surrogacy. The areas where one kind or another of more 
thoughtful or more detailed approaches seem to have been relatively common 
include large-scale psychological and educational testing, market-basket 
selection for economic indices, and weighting aspects of surveys based on 
probability samples. Even so, if we had been asked five or more years ago 
about educational testing, we should have admitted less worry than for other 
fields, but not no worry. 

The most difficult- -and to seme of us the most important- -task of t:AEP, 
to assess changes in what well-defined populations of young people can do, 
requires much more care than traditional educational tv^sting, whether in a 
single class, or ir. a single school, or in a nationwide program. Asking about 
individuals may well require less precision than asking about population means 
(or medians), since standard errors for individuals are Jn times as large 
(and Jn may easily be 50 or more), and thus often dominate measurement 
uncertainty. Asking about performance, even population perfomance, at a 
single time can also be sloppier, since the importance of small deviations 
along a scale is limited when one does not know what specific values on a 
scale mean. This need not be true of asking about short- tem change, since it 
may not be too hard to find relatively solid, widely separated anchor points. 
(We do not need precise knowledge of scale behavior between anchor points, 
since population measures of change will average individual values spread over 
a wide range- -perhaps including the whole range between the anchor 
points--and, consequently, will be relatively little affected by different 
scale shapes --variances, of course, may be affected, but are always estimable, 
at least in large part.) For these reasons, NAEP has had, since its earliest 
years, a much greater need for care in measurement than other educational 
testing programs. 

In the report that follows, Beaton spells out very clearly the tension 
between keeping things the same, to improve measurement of change, and 
changing them, to take advantage of new innovations. In NAEP's earliest 
years, in part because of inherent caution among the members of its technical 
advisory committees, this tension was resolved by strong pressure for keeping 
everything as nearly the same as seemed possible- -though the results reported 
in Chapter 8 could lead one to wonder whether enough was held constant. 

The "changing of the guard," when responsibility for NAEP was 
transferred to the Educational Testing Service, was marked by "new brooms" and 
an influx of more conventional psychometric attitudes. Since the "old guard," 
though concerned with possible effects of moving items, had not really 

vii 



ERIC 



7 



emphasized any concept of "item in context," it is not surprising that the 
"new guard" succumbed to the superstition that behavior of a well-specified 
item was not materially influenced by context. Or that the "new guard" was 
rather more willing to make changes in other detailed aspects of test 
administration. 



We cannot fail to admire the firmness and wisdom shown by the present 
management of NAEP (assisted by its advisors) when the "1986 reading anomaly" 
was first detected. For not only was reporting held up (until results could 
be better understood) but changes were made in the rapidly approaching 1988 
assessment to make it possible to gather relevant data that allowed 
comparisons never before possible, 

I can think of no comparable instance of a behavior -measurement program 
where nearly so careful an analysis was undertaken. I had a hand in an 
-externaUexamina^n_ojEjVlfred C. Kinsey's Sexual Behavior in the Human Male 
(cp. Cochran, et al. , 1955) an'd'^a'-worm^-s-cye^vJLeja^o^^^ external examination 
of the Coleman Report, but these efforts were not comparaEITTnTiatareT-scopeT 
or detail with the examination discussed in the present account. 
Specifically, 

a) they did not have the advantage of planned supplemental 
samples- -of planned experiment, 

b) they were, by farj not as detailed or careful, and 

c) they .were conducted by outsiders, rather than by those responsible 
for the data planning, gathering, and analysis. 

While we are still far from any ultimate quality of measurement, the 
analysis in the body of this report, and the lessons that are recognized as 
having been learned, are a major step forward. As a result, NAEP, in the 
years ahead, may be the first instance of behavioral measurement of which we 
may all be proud as a matter of careful measurement. 

This does not mean that the challenges to NAEP are ended. It is easy, 
and I hope ultimately helpful, to put down here a couple of examples of future 
challenges, so long as we do not consider them to be either exhaustive or 
r epr e s ent a t ive : 

a) Objective selectors and item constructors believe that reporting 
subscales is important. Is this only a matter of monitoring 
whether, at some future date, a subscale or two might show 
distinctive behavior? Or does NAEP measure the subscales 
separately enough so that we can routinely learn from their 
separate reporting? (Or are there other subscales that NAEP does 
measure separately enough which could be reported?) 

b) How should performance in differenc areas --such as mathematics, 
reading, and science- -be compared? Can anything useful be done 
without bringing in relative expectations? (Could NAEP broaden 



viii 



ERIC 



8 



its focus to gather informatrion on rcilative expectations, not only 
of professionals, but of the general public?) 

The lessons learned from the study of the reading anomaly are pointed 
out throughout the whole account that follows. ITiey are focussed most 
sharply, however, in the Epilogue (Chapter 9), 3: feel I can best serve the 
hurried reader by quoting a few of the sharpest and clearest statements: 

a) "When measuring change, do not change the measure." (page 165) 

b) "The tension between continuity and change is not unique to 
educational measurement," (page 166) 

c) ". , , no measurement is perfect, especially the measurement of 
changes over time," (page 167) 

d) "The identification of technological limitations always presents a 
challenge for methodological improvement," (page 168) 



to these must be adaed'*thiB'~jgeniexa:l*T)rinclple7^f-which-(.b)— is^ 

consequence : 

e) THE BEST WAY TO MEASURE CHANGE IS rarely TO MEASURE TIJO LEVELS AS 
BEST WE CAN SEPARATELY, AND THEN SUBTRACT ONE NUMBER FROM THE 
OTHER. 

The Educational Testing Service, its NAEP scientists, and the authors of 
this report deserve hearty congratulations from all of us for bravery, 
insight, stick- to - itiveness , and care in inquiry, and for clarity and honesty 
of exposition. 



JOHN W, TUKEY 
Princeton, New Jersey 
December 13, 1989 



ix 



THE EFFECT OF CHANGES IN THE NATIONAL ASSESSHENT: 
DISENTANGLING THE NAEP 1985-86 READING ANOMALY 



EXECUTIVE SUMMARY 



Albert E. Beaton 



Since 1969, the National Assessment of Educational Progress (NAEP) has 
reported what students in American schools, both public and private, know and 
can do. Over the years, NAEP has assessed student proficiency in reading, 
writing, mathematics, and science as well as a number of other subject areas. 
Reading proficiency has been assessed six times --in 1971, 1975, 1980, 1984, 
1986, and 1988- -and will be assessed again in 1990. NAEP has focused on 
measuring educational trends, and its long-term trend reports have been based 

"oiTTzTiS^asl^elg^meTitr^ ^ 

probability samples of 9-, 13-, and 17-year-old students. NAEP has become a 
respected indicator of progress in American education. 

Maintaining the integrity and credibility of NAEP requires the 
development and careful execution of a complex assessment design and, 
ultimately, sound professional judgment. The original analysis of the 1986 
reading trend data showed anomalous results. The estimated performance level 
of 9- and 17 -year-old students had dropped dramatically from 1984, whereas the 
performance of 13-year-olds had increased very slightly. Since it was deemed 
unlikely that such large changes could have taken place in such a short time, 
it was decided not to present the results to the general public and to suspend 
publication of the results until further research into the accuracy of the 
results could be completed or other corroborating evidence for the declines 
could be found. To collect such additional evidence, the design of the 1988 
National Assessment was modified, and the new data have now been collected and 
analyzed. The purpose of thjs report is to present the results of this new 
research into the anomalous results of the 1986 reading assessment. 

The main research finding from the study of the 1988 data is that the 
changes in assessment booklets and procedures that were introduced in 1986 had 
a substantial and unpredictable effect on the estimates of performance. 
Although many of the same items were used in both the 1984 and 1986 
assessments, student performance on these iter a differed substantially when 
the items were administered in different contexts. The additional data 
gathered in the 1988 assessment allowed the study of the differences between 
assessment systems, and the differences that were found were used to 
reestimate the reading performance of students in 1986. 

The new estimates show that reading performance declined slightly in 
1986 at all age levels from the 1984 levels. The declines, however, are not 
statistically significant; that is, estimated declines of this magnitude might 



xi 



ERIC 




have occurred through random variation even if there had been no t ^tual 
changes in student performance between 1984 and 1986. The new data also show 
a rebound in 1988 from the 1986 levels to about the same level of performance 
that was exhibited in 1984. 

The research into the anomalous data can be further verified in 1990. 
The 1990 assessment will produce additional data that will be available to 
check the accuracy of the modified estimates and to investigate ways to 
i^iprove the estimates of error in all assessments* 




xii 



11 



Chapter 1 
INTRODUCTION 



Albert E. Beaton^ 



The National Assessment of Educational Progress (NAEP) is an ongoing, 
congressionally mandated survey that has been designed to measure what 
students in American schools know and can do. Since 1969, NAEP has been 
measuring trends in student performance in many academic subject areas, 
-—including ^eading^_wi:iting^_inatiiematics^_and«scienc.eL*_J7AEE^ 

reports are based on carefully selected national probability samples of 9-^ 
13-, and 17-year-old students in American schools, both public and private. 
NAEP is the only regularly conducted survey of educational achievement at the 
elementary, middle, and high school levels. 

As in any long-term project measuring change, there 'is a tension between 
measuring trends in education, which implies maintaining continuity with 
NAEP's past objectives and measurement procedures, and introducing the best 
new curriculum concepts and measurement technology, which implies making 
changes from past assessments. In the 1984 and 1986 assessments^, a number of 
innovations were introduced into NAEP in order to imptcve the measurement and 
reporting of educational proficiency. It was expected that the new technology 

^Jo-Ling Liang produced the figures in this chapter. 

^Although assessments are conducted in school years, this report will refer 
to assessments by the second year only. For example, the 1985-86 assessment will 
be referred to as the 1986 assessment. 



1 



ERIC 




would not produce results directly comparable to previous assessments, so 
safeguards were also introduced to protect relevance of the data collected In 
previous assessments. The safeguards included special "bridge" samples that 
were assessed in the same way as in past assessments and were intended to form 
bridges between the new and old assessment technologies. However, even with 
these safeguards, the difficulty involved in maintaining accurate trend 
measures while introducing innovations became apparent when the 1986 NAEP 
reading trend data were analyzed. This report is the story of the changes 
that were made and the effects they had on the estimation of student 
performance; it also describes some of the lessons learned about measuring 
trends, which should provide a valuable contribution to assessment and 
psychometric research. 

A major innovation in the 1984 and 1986 assessments was the introduction 
of the NAEP scales. NAEP scale scores can range from 0 to 500, but tjrpically 
fall between 100 and 400. The scale scores may be interpreted as estimated 
scores on a hypothetical 500- item test with certain idealized properties. 
Using item response theory (IRT) , the NAEP scales are developmental in the 
sense that 9-, 13-, and 17 -year-old students are reported on the same subject 
area scales, and their proficiencies can be compared. In 1986, subscaling was 
introduced in mathematics and science so that proficiency in different parts 
of a subject area (e.g., algebra and physical science) could also be reported. 
Scale anchoring was introduced to report what students at a particular score 
level knew and were able to do that students scoring at lower levels could 
not. A full description of the NAEP scales and the technology used in NAEP 
can be found in Beaton (1987, 1988b). A general discussion of the issues in 
NAEP scaling are in Misievy (1988) . 



ERIC 



2 

13 



^,%e bridge samples were essential for estimating performance In past 
assessments on the new NAEP scales. In 1984, reading and writing were 
assessed. In the 1984 reading assessment, samples of students were assessed 
us lag the newer technology and randomly equivalent samples of students were 
assessed using the same technology that NAEP had used In previous assessments. 
Using the fact that the randomly equivalent samples were in principle 
equivalent in reading proficiency, except for sampling error, the results from 
previous assessments were projected onto the new NAEP reading scale. Dota 
from past writing assessments were not projected onto the new (non-IRT) 
writing scale. In 1986, mathematics and science were both assessed tising 
eqtiivalent samples, one using the new technology and the other using 
traditional NAEP practices, and then the data from previous assessments were 
projected onto the new mathematics and science scales. Reading was also 
assessed in 1986, but this time using only the newer technology, since the 
change in technology had already been bridged in 1984. Although the general 
technology of the 1984 and 1986 reading assessments remained the same, some 
seemingly minor modifications of the booklets and administrative procedures 
did occur, and there was no bridge sample with which to measure their effect. 

When they were first produced, the NAEP estimates of the reading 
proficiency of students in American schools in 1986 appeared anomalous in the 
judgment of the NAEP project staff and its technical advisors. Tims, the 
publication of trend results from the 1986 assessment was suspended; Instead, 
the trend results were subjected to further investigation and documented in a 
technical report (Beaton, 1988a). The anomalous estimates are shown in Figure 
1.1. The average reading proficiency estimates for 1986 indicate very sharp 
declines at ages 9 and 17 from the estimates for the 1984 students but no 

3 

ERiC ' 



Figure 1,1 



Estimated Average Reading Performance, 1971 - 1988 
with Anomalous Results for 1986 
(and standard errors*) 



500 



300 

290 

280- 

270 

260 

250- 

240 

230 

220- 

210- 

200 




Age 17 



Age 13 



Age 9 



0 1971 



1975 



1980 



1984 



1986 



1988 



*Bonds extend froo two estioated standard errors below to two estimated stand^d 
errors above the mean. Appendix A (p. 171) gives a summary of which modifications of 
reading scale results are used in the tables and figures in this report. 



Estimated Average Reading Performance, 1S71-1988 
with Anomalous Results for 1986 



Year 


A«e 9 (S.E.) 


1971 


207.3 (1.0) 


1975 


210.2 (0.7) 


1880 


214.8 (1.1) 


1984 


212.9 (1.0) 


1986 


207.3 (1.^) 



Age 13 (S.E.) 

255.2 (0.9) 
256.0 (0.8) 
258.5 (0.9) 
258.0 (0.7) 
260.4 (1.1) 



Age 17 (S.E.) 

285.4 (1.2) 
286.1 (0.8) 
285.8 (1.4) 
288.8 (0.9) 
277.4 (1.0) 



ERLC 



15 



corresponding declino--in fact, a slight rise--in average reading proficiency 
at age 13. The data suggested that the reading proficiency of the 1986 
students was substantially more variable at all three age levels than in past 
assessments, with the result that more students were estimated to be at both 
very high and very low levels of reading proficiency. Such substantial 
changes in reading proficiency were considered extremely unlikely to have 
occurred over a two-year period without being noticed and reported by the 
teaching profession. Therefore, it was recommended that these results should 
not be tised for estimating trends in American education until supported by 
corroborating evidence. 

The purpose of this report is to present a detailed technical 
explanation for modified estimates of the trends in reading performance for 
the years 1971 to 1988, including the 1986 results. Substantial new evidence 
has been collected and, after a reanalysis of the reading trend data that 
included additional data from the 1988 assessment, the estimated long-term 
trends in student reading proficiency have been modified. The modified 
reading trend estimates, extended to 1988, are shown in Figure 1.2. 

The modified trend estimates suggest that the average reading 
proficiency of students declined slightly at all three age levels from 1984 to 
1986 and that the 1988 students rebounded to about the same averages as their 
1984 counterparts. These new trend estimates show similar declines at all 
three age levels in 1986, not the steep declines that appeared in the first 
runs of the data at ages 9 and 17. The variances in student performance are 
new reasonably similar over the several assessment years. The remaining 
apparent decline in 1986, although slight and not statistically significant, 
and the apparent rebound in 1988 are not fully understood. Consequently, the 

5 

ERslC 16 



Figure 1,2 




«rror« nh^ X! estimated standard errors belw to tno estimated standard 

rffdfn^^^io "ff ' Appendix A (p. 171) gives a suamary of Which modifications ol 
reading scale results are used in the tables and figures in this report. 



Modified Results 
. Weighted Reading Proficiency Means and Standard Errors 

XSSr 9 (StEt? Axe 13 fS.E.^ Age 17 (S.E.> 

1971 207.3 (1.0) 255.2 (0.9) 285.4 (1.2) 

1980 J?-^^ 258.0 (0.8) 286.1 0.8 

nil oJJ'n J'i 2^^-^ 285.8 (1.4) 

211.0 (1.0) 257.1 (0.7) 9AA A ?n 

1986 Adj>»ted 208 6 (1 9, Ls.O [l.l] Llio Si? 

211.8 (1,2 ) 257.5 (0.9 ) 290.1 (1.1) 



6 

17 



questionable 1986 reading proficiency estimates are not presented to the 
general public in The Reading Report Card, 1971 to 1988 (Mullis & Jenkins, 
1990) • 

The investigation into tho anomalous 1986 reading results is not yet 
complete. The analysis of the data collected during the 1988 assessment has 
resulted in improved procedures for estimating trends in educational 
performance. In the 1990 assessment, more data be will collected so that the 
effectiveness of the newer methods can be tested in actual practice. The 
results of the further investigation will be published when they are complete. 

* * * 

Recommending that the publication of timely results be postponed is a 
serious matter. The decision to withhold results hinges upon contrasting the 
possible harm of publishing erroneous results against the possible 
consequences of failing to publish correct results. For example, the shairp 
decline in performance at age 17 might not have been an accurate 
representation of true changes in student performance but rather the result of 
flawed assessment procedures, such as errors in assessment booklets, sampling 
procedures, field administration, data processing, or scaling. On the advice 
of the NAEP Design and Analysis Committee^, the design, administration, and 
analysis of the 1986 assessment were carefully reviewed by ETS/NAEP staff and 
a full report. The NAEP 1985-86 Reading Anomaly: A Technical Report (Beaton, 
1988a) was prepared and published. The NAEP Assessment Policy Committee and 
the staff of the National Center for Education Statistics (NCES) of the U.S. 



^The Design and Analysis Committee members are Robert Linn (Chair), John 
B. Carroll, Robert Glaser, Bert Green, Jr., Sylvia Johnson, Ingram Olkin, Tej 
Pandey, Richard Snow, and John W. Tukey. Barbara Shapiro served as an observer 
for the NAEP Assessment Policy Committee. 



Office for Educational Research and Improvement (OERI) concurred with the 
Design and Analysis Conmittee's advice to suspend publication of the 1986 
reading trend results until more information was available to support the 
apparent decline. The OERI also commissioned an independent Technical Review 
Panel to prepare a thorough review of NAEP technology, and this panel has 
published its report (Haertel, 1989). 

Fortunately, the early discovery of the 1986 reading anomaly and the 
consequent postponement of the publication of the. 1986 reading trend results 
allowed enough time not only to review NAEP procedures thoroughly but also to 
modify the following assessment in 1988 in order to collect additional, 
corroborating data. In this detailed review of NAEP procedures, a number of 
hjrpotheses about the reason for the decline were investigated; these 
hjrpotheses are summarized in Chapter 2 of this report. Most of the hypotheses 
were rejected as very unlikely to have caused such precipitous declines in 
estimated student performance, but some could not be accepted or rejected 
because additional information was needed. In particular, in 1986 there were 
some seemingly minor changes in the administrative procedures, assessment 
booklets, and timing from the assessment in 1984. It also had to be 
considered that a combination of causes, not one cause alone, could explain 
the anomalous results (Hedges, 1989). Since there was still a possibility 
that the 1986 assessment results did not represent true change in student 
performance, the prudent course was to suspend pub7^ication until corroborative 
information was available. 



8 



19 



At the time the anomalous results were discovered, the 1988 assessment 
had already been designed but not yet implemented, and it was early enough in 
the assessment cycle to modify the 1988 assessment design in order to gather 
some essential explanatory information. The modifications of the design are 
detailed in Chapter 3. Briefly, in 1988 both the 1984 and 1986 assessment 
procedures, booklets, and timings were administered to separate, randomly 
equivalent samples of 1988 students. To reproduce the 1986 assessment 
procedures precisely required administering some mathematics and science 
questions in 1988, even though these subject areas were not included in the 
original 1988 NAEP design. The pertinent data have now been collected and 
analyzed, and the data from 1986 as well as data from previous assessments 
have been subjected to further analyses and investigation. The mathematics 
and science data have been analyzed for comparison to those collected in 1986. 

A summary of the recent investigations and their effect in modifying the 
estimated trend results are presented in Chapter 4 of this report. Subsequent 
chapters contain the details of the studies that contributed to the trend 
modifications • 

if ^ ^ 

The new data and analyses explain a good part, but not all, of the 
anomalous estimates of reading performance. Thus, the correctness of the 
decision to postpone presenting the 1986 trend results to the general public 
has been generally supported. Many individuals and agencies participated in 
the decision to suspend broad dissemination of the results until the questions 
about their validity could be satisfied, despite the obvious difficulties in 
missing an important publication deadline. Such careful professional judgment 
is essential to the continued integrity and credibility of NAEP. 

9 

I 20 



Although the delay in presenting results was unfortunate, the 1986 
reading anomaly has important lessons for future assessments and for other 
educational measurement programs as well; indeed, there are valuable lessons 
for the public in its perception of the rs»sults of any survey or poll. We 
would be remiss in not reporting what we have learned as well as reporting the 
modified results. More will be said about this elsewhere in this report. 

One overall lesson stands out: When measuring change, do not change the 
measure. Precise implementation of this dictum is, of course, impossible in 
actual practice. In fact, NAEP has modified its measurement instruments by 
rearranging and reformatting assessment exercises since it began measuring 
trends. 

When ETS became the NAEP grantee in 1983, it introduced item response 
theory (IRT) into NAEP in order to fulfill in an efficient manner NAEP's 
primary goal of reporting to the public what students in American schools know 
and can do. It is important to stress that the introduction of IRT technology 
did not cause the anomalous results; however, IRT could not compensate for the 
format and context changes either. Under the assumptions of IRT, test items 
have characteristics that are invariant in different contexts, and this 
property has been widely publicized and valued in the psychometric literature 
(see Chapter 6). Assuming this property and following past NAEP practice, the 
1986 assessment booklets included many 1984 items, but placed them in 
different contexts. The results of the ana^ses of the data from the 
redesigned 1988 NAEP have demonstrated that, contrary to accepted assumptions, 



10 



21 



the item context substantially affected thf- behavior of these items. This 
effect is shown to be the major contributor to the 1986 reading anomaly. 

Although they are slightly Isss easy to see and more difficult to 
isolate, the same measurement chang€ss affect the proportion of students who 
respond correctly to individiaal items and thus also affect the average 
percentage of correct responses to a group of items. The adoption of IRT 
procedures in NAEP did not cause the anomalo?jis results, but rather dramatized 
the effect of the measuring instrument on tha perception of the phenomenon 
measured. It should be noted that the inferences of IRT are '^'alid given the 
truth of the assumptions, but the assumptions may not be true; they are 
assumptions about the state of nature, not natural laws. In most IRT 
applications that compare individual students' scores or changes over time, 
violations of the assumptions may result in inaccuracies small pT ;^^ to be 
ignored, since the inaccuracies are typically less than the error of 
measurement. However, changes in format and context that may be considered 
negligible when comparing individuals may not be negligible when comparing 
differences among subpopulations over time (see Chapter 8 of this report) . In 
the particular case of NAEP, the effects of changes in measurement were 
apparently larger than the trend effects that were being measured. .Thus, 
maintaining identical instruments is critical when looking for small 
differences. 

This important lesson has led to an improvement in the 1990 NAEP design 
that has also been proposed for future assessments. In the 1990 design for 
long-term age trends, the trends in proficiency will be estimated using 
identical assessment booklets, adminictrative procedur^js, and timings as in 
the last assessment in the same subject area. In other words, each subject 

11 



area for which trends are reported will duplicate as clor.ely as possible the 
previous assessment with which it is to be compared, including booklets 
printed from the sane plates, identical instructions for administration, and 
precise replications of definitions for target populations. In the future, if 
new assessment instrviments are developed, they will be used to estimate long- 
term age trends only after they have been administered in two different 
assessments and their relationship to the previous trend instrviments has been 
firmly established. This design improvement has the important effect of 
separating the part of NAEP used for trend estimation from the part of NAEP 
used to prepare detailed estimates of the proficiencies of the current 
students in American schools*. 

As mentioned earlier, NAEP has experienced a continuing tension between 
the need to retain comparability with the past and the need to be innovative 
in assessing the curriculum that is currently valued and taught. The new 
design separates the two NAEP functions, with separate samples dedicated to 
each. The trend samples are required to replicate past assessments as closely 
as possible; the cross -sectional samples are free to be innovative, 
introducing new objectives, new items, and new technologies. If curriculiim 



^In 1990, the long-term age trend samples will duplicate p-at measurement 
procedures as closely as possible and will use identical assessment booklets. 
The 1990 main NAEP samples will be used for short-term grade trends in reading. 
These samples will be BIB-spiraled; that is, the items will be divided into seven 
blocks, and three of these blocks will be placed in each assessment booklet using 
a balanced incomplete block (BIB) design. In this way, each item block will 
appear with each other block in one booklet, and each student will be asked to 
respond to orly three of the seven blocks of items. Three to four of the seven 
blocks at each age/grade level will be identical to those used in the 1988 
assessment, although they may be administered in conjunction with new reading 
blocks. It should also be noted that because the designs for future assessments 
will have to be consistent with the legal requirement that half of the items be 
publicly released, NAEP may have to differ somewhat from the ideal of maintaining 
identical measuring instruments, but adequate samples to estimate the effects 
of such differences will be maintained. 

12 



ERLC 



23 



changes become so substantial as to render the trend samples obsolete, and if 
the new, cross -sectional samples stabilize, it may be possible to replace the 
older trend data with newer, more relevant trend information. In any case, 
any innovation introduced into the cross -sectional assessment will be linked 
to the trend scale, when this proves possible, and then moved into the trend 
portion of the assessment when this is desirable and the links to the trend 
data have been fully r 'tablished. 



ERIC 



13 



^4 



Chapter 2 

SUMMARY OF EARLIER RESEARCH ON THE READING AN.^MALY 
Albert E* Beaton 



The discovery of the anomalous trend estimates during the analysis of 
the 1986 data started a flurry of activity to identify their causes. At 
first, it was assumed that sone data processing error- -such as a bug in a 
computer program- -would explain the unusual results, so a concerted effort was 
made to examine the computer systems from data entry to final trend 
estimation. When no such error was found, we investigated more complex 
reasons for the anomaly. Finally, when no conclusive reason Mje.s found for the 
sharp changes in estimated performance, the 1988 NAEP design was modified to 
collect new data that we hoped would explain the anomaly. 

This chapter presents a summary of the investigations into the reading 
anomaly that took place before the new explanatory data were collected. These 
studies are not reported in detail here since they have already been fully 
reported in The NAEP 1985-86 Reading Anomaly: A Technical Report (Beaton, 
1988a); the reader is directed to that report for technical details. This 
previous report on the reading anomaly has been thoroughly reviewed and 
discussed by the Technical Review Panel on the 1986 Reading Anomaly (Haertel, 
1989). The Technical Review Panel performed additional investigations into 
the anomalous results, and presented the results of these studies in its 
report . 

15 



ERIC 



25 



Eight general classifications of hypotheses were investigated in tje 
original study, relating to population and sample, measuring instruments, 
administrative changes, quality control, scaling, items, booklets and blocks S 
and others. 

Population/Sample Hypotheses 

The first set of hypotheses revolved around the possibility that either 
the composition of the populations of students that NAEP assesses or the 
actual NAEP samples from these populations of students had changed in some 
substantial way that would result in sharp declines in performance. The MEP 
population is very precisely defined and the sampU carefully drawn. However, 
a sharp change in the population- -such as an increase in the number of 
traditionally low-scoring students --would be likely to result in a d line in 
average proficiency. Also, it was important to assure that the samples that 
were actually drawn were net unusual and were representative of the intended 
student populations within the range of sampling variability. 

Detailed study showed no reason to believe that the NAEP sample was not 
representative of the nation's students. First, we do not know how to produce 
a better estimate of the nucibers of in-school 9-, 13-, and 17-year-old 
students, since the sampling weights produced using the present NAEP 

^A block is a timed portion of an assessment that contains assessment items 
and/or background and attitude questions. In the 1984 and 1986 assessments, each 
assessment booklet contained a common block, which included background and 
attitude questions, and three variable blocks, which included mostly subject 
area exercises but also some background or attitude questions. Variable blocks 
are assigned to booklets using a balanced incomplete block design, and the 
booklets are placed in a random sequence (spiraled) so that students in an 
assessment session receive different booklets. A student is allowed about an hour 
to rer^pond to a booklet. The timings and contents of the blocks are discussed 
in Chapter 3. 

16 



26 



technology are poststratified^ using information from the Census Bureau, the 
Current Population Survey, and the NAEP samples themselves. In any case, the 
small differences in the NAEP estimates of population sizes since 1984 could 
not have had a major effect on the average student performance. 

E. G. Johnson (1988a) described the NAEP sampling process. There was no 
substantial difference in the percentage of students excluded from NAEP 
because of limited English proficiency, behavioral disorders, or physical or 
mental handicap. He did not find any reason to believe that a substantial 
change in the dropout rate for 17-year-olds occurred. 

Johnson (1988b) also investigated the attributes of low scorers to see if 
there was an unusual increase in low scores in any discernible subgroup of 
students. He found, however, that the proportion of low scorers increased in 
all major subgroups of students, not merely in one or two. Johnson also 
examined the data to detemine whether the decline was concentrated in a few 
schools and concluded that it was not. 

Although the changes in population sizes were slight, Beaton (1988a, 
Chapter 6) investigated whether the slight changes could have a major effect 
on the estimates of reading performance, and concluded that they could not 
have substantially affected the overall trend estimates. In fnct, the 
evidence showed that the decline in proficiency at age 17 was pervasive, 
occurring in all of the groups for which NAEP has traditionally reported 
results. In fact, estimated proficiencies declined for 

• boys and girls, with the decline for boys somewhat 
larger than that for girls; 

• all racial and ethnic groups; 



^See Appendix B. 



• all regions of the country, with the decline being 
least in the Northeast; and 

• students whose parents did not graduate from high 
school, students whose parents did, and students whose 
parents had some education beyond high school* 

We therefore concluded that neither population shifts nor the 
composition of the NAEP sample contributed substantially to the reading 
anomaly. 

Maasuring Instrument Hypotheses 

It was thought that if the populations and samples did not explain the 
reading anomaly, then, perhaps, the measurement instruments would. 
Accordingly, we investigated several hypotheses about the assessment forms. 
We foimd that there were a number of seemingly minor differences in the 
assessment forms used in NAEP between 1984 and 1986. These changes are 
documented in detail by J. R. Johnson (1988); some of the changes are also 
presented in the next chapter of this report. 

We had no reason to doubt the validity of the NAEP 1986 reading 
assessment as a measure of reading proficiency. The separately timed and 
scored blocks of assessment items may be considered as short tests, and as 
such they were subjected to standard item analyses. The item analysis 
statistics were comparable to those that occur in tests of similar length. 
The student number-right scores on various assessment blocks were correlated. 
For age 17, the median correlation, as well as the range of correlations, 
among the reading blocks and between the reading, mathematics, science, and 
computer competence blocks is shown in Table 2.1. The reading blocks 
contained^ewer items than the blocks in other curricultim areas; however, the 

18 



28 



Table 2.1 



Correlations Among NAEP Blocks 
1986 Assessment, Age 17 



Reading 



Reading 

N 15 

Median . 65 

Range (.48 - .75) 



Hath 

12 
.60 
(.46 - .65) 



Science 

14 

.54 
(.39 - .66) 



Computer 
Competence 

28 
.38 

(.19 - .57) 



Math 



N 

Median 
Range 



55 
.74 
(.58 - .92) 



28 
.62 
(.48 - .80) 



14 
.52 
(.24 - .60) 



Science 



N 

Median 
Range 



55 
.62 

(.46 - .72) 



12 
.51 

(.22 - .63) 



Computer N 
Competence Median 
Range 



15 
.57 

(.40 - .66) 



ERIC 



19 



29 



average reliability of individual reading items, which was estimated using the 
Spearman-Brown prophecy formula, is estimated to be greater than in other 
curriculum areas. 

Although there was no reason to doubt that the 1986 assessment measured 
reading, the changes in the assessment instruments did lead to the suspicion 
that the 1986 assessment measured reading differently, in a way not fully 
comparable with past assessments. Before the physical changes in the 
measuring instruments were made, they were judged by professional staff to be 
so minor as not to affect the students' responses in a substantial way. For 
example, there was a change in the number of items in an assessment booklet, 
but there was also a corresponding change in the amount of time allocated to 
respond. Since we could not be sure that these minor changes did not have a 
major effect, we thorefcra could not reject the hypothesis that the changes in 
the assessment instruments produced the sharp declines in estimated 
performance . 

Administrative Changes Hypotheses 

We also investigated hypotheses about the administration of the 
assessment in the field. Perhaps some changes in procedure had affected 
student performance. In fact, a number of changes were made in the 
administrative procedures; these are discussed by J. R. Johnson (1988) and 
summarized in Chapter 3 of this report. For example, the average number of 
students in an assessment session increased at age 17 from approximately 20 
students in 1984 to approximately 35 students in 1986. Investigation of the 
reading results by the size of the assessment session showed no reason to 
suspect that changes in session size had a substantial effect on the results 

20 

30 



for the 17-year-olds, Another change Involved the time of assessment for 9- 
year-olds, which changed from January 2 to March 9 in 1984 to January 6 to 
January 31 in 1986. Investigation of this change did not seem to explain a 
change of the magnitude found for 9-year-olds. In addition, using field 
observations, the NAEP subcontractor Westat, Inc., (Slobasky, 1988) reviewed 
administrative procedures used in 1984 and 1986, but failed to find changes 
that were considered likely to have affected only reading for the 9- and 17- 
year-olds. 

We could not be sure, however, that seemingly minor changes in the 
design specifications and resulting procedures did not have an effect; 
therefore, we could not reject die hjrpotheses that administrative changes 
might have affected estimated student perfomance. 

Quality Control Hypotheses 

A logical possible source of the apparent decline was inaccuracy in the 
data processing. NAEP data were already subject to strict quality control 
procedures (see Beaton, 1987) , but to assure independently the accuracy of the 
data, we selected a copy of each type of booklet at random and confirmed that 
student responses in the assessment booklets were accurately recorded in the 
database. A study of the database, described by Ferris (1988), showed it to 
be very accurate. An external consultant. Dr. W. B. Schrader (1988), also 
reviewed this process and found no basis for questioning the database or the 
scoring keys. Computations of proportions passing various items were done by 
several programs, and the results were in agreement. 

We therefore concluded that gross errors in the database and major 
computational errors could be ruled out as explanations of the decline. We 

21 



could not, however, completely rule out the possibility that minor errors did 
occur in 1986 or that other errors occurred in previous years. 

Scaling Hypotheses 

NAEP uses a complex process to estimate the distribution of reading 
proficiency. We therefore investigated hypotheses that the anomalous decline 
was an artifact of the scaling process that was used to develop and equate the 
NAEP scales. 

An approximate method has been developed for estimating average 
proficiency on the reading scale from the average percentage of items that the 
students answered correctly, without any scaling of the data. This method, 
described by Mislevy (1988), shows that the decline in the average proportion 
answering items correctly is consistent with the decline in reading 
proficiency estimated from the scaling procedure. 

We therefore ruled out the scaling process as the cause of a substantial 
part of the decline in reading proficiency. 

Item-level Hypotheses 

Another set of hjrpotheses involved the responses to assessment items. 
Several questions were pursued about individual items. Were one or a few 
items so dramatically different that the decline is attributable to only a few 
items? Was there a change in the way that students responded to particular 
items? These hjrpotheses were examined by E. G. Johnson (1988c). 

In summary, there was neither one nor a few items that behaved 
differently enough from past assessments to affect the entire results. In 
general, the 17-year-old students were less likely than those of previous 

22 

32 



yaars to respond correctly to an item, more likely to respond incorrectly or 
to select "I don't know," and slightly less likely to omit or not reach items. 
These changes in the don't know," omitted, and not-reached rates were found 
to contribute little to the decline. The decline seemed, therefore, to be 
associated with performance in general rather than a few unusual items. 

We therefore rejected the hypotheses about one or a few aberrant items. 

Booklet and Block Hypotheses 

A number of hypotheses were developed about the * em blocks, assessment 
booklets, and the context in which they were administered. We hypothesized 
that a student might respond differently to a reading exercise when the 
exercise i{5 placed in a booklet with mathematics or science exercises. The 
effect of changing the context and position within a booklet of reading blocks 
was studied, and the results were reported by Zwick (1988a). 

The study showed that, in most cases, the context and position of the 
block had a small effect on reading performance. There vras some evidence that 
reading performance was adversely affected in reading blocks that followed two 
nonreading blocks, but even when the booklets containing this mix of blocks 
were removed, the sharp decline in estimated reading proficiency remained. 

We therefore felt at the time of these analyses that the mixture of 
blocks within booklets did not contribute in a major way to the anomalous 
results. The question of the placement of items within blocks could not be 
studied without the additional data discussed in Chapter 3. 




23 



13 



Other Hypotheses 

When none of the preceding theories about the 1986 assessment seemed to 
give an adequate explanation for the declines in estimated reading 
proficiency, a number of other hypotheses were explored. Two examples follow. 

The external event hypothesis . We looked for some event in 1985 or 1986 
that might have affected the way the students responded to NAEP. We found 
one- -the Challenger disaster- -which occurred during the last week of the 
assessment of the 9-year-olds, We felt that this tragedy might have affected 
the students emotionally, thereby influencing their performance. The study of 
this hypothesis is discussed by Beaton (1988a, Chapter 12). To investigate 
it, the data for 9-year-olds was separated by aay of assessment and reviewed 
for any large increase in the number of low scorers immediately after the 
Challenger disaster occurred. No substantial change in the proportion of low 
scorers was discerned. 

The hypothesis that the 1984 assessment results were unusually high . 
This hypothesis was investigated by performing comparisons of 1984 with 
earlier years. Although the 1984 average reading performance was higher at 
age 17 than in previous years, the decline in 1986 would still be substantial 
even if compared to the results of the earlier assessments. 

The examination of these hypotheses did not seem to explain the reading 



anomaly . 



* * * 



24 



ERIC 




In summary, these first investigations of the reasons for the 1986 
reading anomaly were inconclusive. Although a number of possible explanations 
for the estimated decline in reading proficiency were discredited, there was 
insufficient information in the data to discredit a number of others. It was 
clearly possible that the seemingly minor changes in the assessment booklets 
or in administrative procedures may have had a sufficient effect on the 
responses of students to produce such anomalous results. We therefore 
modified the 1988 sample to collect data that could lead to a clarification of 
these issues. These changes are discussed in the next chapter. 



25 



ERJC 



chapter 3 
THE REDESIGN OF THE 1988 ASSESSMENT 



Albert E. Beaton^ 



Although the research sunaiarized in Chapter 2 rejected beyond a 
reasonable doubt several hypotheses about the 1986 reading anomaly, several 
other hjrpo theses remained viable, inasmuch as sufficient information was not 
a mailable either to confirm or to reject them. In particular, there were 
changes in the assessment booklets and administrative procedures. Although 
these changes had been believed to be minor and unlikely to have a major 
effect on student performance, there was no way to establish the magnitude of 
the effect if, indeed, any effect did exist. In order to estimate the effect, 
the design of the 1988 NAEP was modified, as described below. 

The 1988 assessment had been designed to assess performance in reading, 
writing, civics, and U.S. history* Assessments in mathematics or science were 
scheduled for 1990. As in past assessments, the design encompassed students 
enrolled in American schools, both public and private, at ages 9, 13, and 17, 
and, for some purposes, overlapping samples of fourth-, eighth-, eleventh-, 
and twelfth-grade students. The design of the entire 1988 assessment wi31 be 
discussed in the 1988 technical report; this chapter will detail only the 
parts of the design that are relevant to investigating the reading trend 
estimates that were obtained in 1986, 

^The tables in this chapter were produced by Jo-Ling Liang. 

27 

3S 



Before discussing the redesign of the 1988 assessment, it is important 
to understand the similarities and differences between the 1984 and 1986 
student samples that were used in estimating the reading trend. In both 1984 
and 1986, NAEP was implementing the new design (see Messick, Beaton, & Lord, 
1983) that had been proposed to improve its efficiency and usefulness. The 
1984 assessment introduced many design changes, but still more were desirable. 
The new technology was introduced in the "main" NAEP samples. In these 
samples, an important change was made in the definition of the age categories, 
and thus different samples of students had to be assessed for estimating 
trends. NAEP had traditionally defined age categories differently for ages 9 
and 13 than for age 17. For the main NAEP 1986 samples, uniform age 
definitions were used, changing the 1986 population of 9-year-olds to mostly 
third graders and the population of 13 -year -olds to mostly seventh graders, 
instead of fourth and eighth graders respsctively as in the past. Other 
important changes were also introduced into the main 1986 assessment. 

Because such changes would have destroyed comparability with the past 
and thus the ability to estimate trends, separate samples, called "bridge 
samples," were assessed in 1986 at ages 9 and 13. In these bridge samples, 
all age populations were defined exactly as in past assessments. 
Consequently, the new data from these bridge samples were presvimed to be 
comparable to the data from past assessments. Since the age definition of 17 
year-olds did not change in 1986, it was felt that the main NAEP sample could 
be used for the measurement of trends for that age without the addition of a 
separate sample. 

As discussed in the previous chapter, some differences did occur between 
the 1984 assessment and the portions of the 1986 assessment used for trend 

28 

37 



estimation, and it was presumed at the time that theso differences were minor 
and would not have a noticeable effect on the assessment results. 
Similarities and differences between the 1984 and 1986 samples that were used 
for trend analysis are compared in the second and third columns of Tables 3*1 
.to 3.3. These tables are adapted from a table in a chapter by J. R. Johnson 
(1988), which also contains more detailed information about the differences 
between the 1984 and 1986 assessments. 

The age definitions for trend samples in 1984 and 1986 are comparable. 
The estimation of reading trends used student populations defined by age only, 
since, before the 1984 assessment, only age populations were sampled and thus 
no long-term trend data are available by grade. 

The 1984 students used for estimating the reading trend between 1984 and 
1986 were assessed using BIB (Balanced Incomplete Block) spiraling^ at all age 
levels. The purpose of BIB spiraling is to allow us to administer a large 
pool of items without a heavy burden on any individual student while retaining 
the ability to estimate the interrelationship between each pair of items. BIB 
spiraling requires developing and administering a set of assessment booklets 
so that most students in an assessment session receive different booklets, 
although various pairings of subsets of the items appear across booklets. To 
form the booklets, the items are organized into equally timed "blocks," each 
representing a subset of the entire item pool. These blocks are permuted so 



^In 1984, two randomly equivalent samples were selected at each age level. 
In order to maintain continuity with past NAEP practices, reading and writing 
were assessed in one sample using matrix sampling and tape recorded 
administration, as in the past. In the other sample, reading and writing were 
assessed using BIB spiraling. The two measurement systems were equated. The 
1984 reading scale means were obtained by weighting the BIB and paced means in 
inverse proportion to their squared standard errors and then stimming. 

29 

38 



Table 3.1 



Comparison of Data Used to Measure Reading Trend, Age 9 



Characteriaticg 
Description oZ S&icplo 

^4odal Grade 
Curriculum Areas 

S&mplt Size 
(tho number of 
students with reading 
scale values) 

Ase Definition 

Method of 
Assessoent 

Dates Assessed 

Time— CocTxnon 
background block 

Time —Cognitive block 

Kumber of Reading 
Blocks Administered 

Booklet Printing and 
Binding 

Response h5ode 

Scoring Method 

Teacher Questionnfiire 



1984 Sample 
Used for Trend 

Age subs&mple of main 
KAEF sample 

4 

Reading, writing 



16,799 



Calendar year 
Jan. -Dec. 1974 

Frinved 



Winter 1984 
1/2 - 3/19 

Approximately IS 
minutes (questions 
were road aloud to 
students)^*} 

14 minutes 
28 uinutes for each 
of the double-length 
blocks (U, V, H) 

12 



Blue ink, saddle- 
stitched 

Circle letter 

Key^entered 

Language arts teacher 
was identified by 
students 



19&U Sample 
Used for Trend 

Ago-only sample 



4 

Reading, mathematics, 
science 

6,932 



Calendar year 
Jan. -Dec. 1976 

Mathematics and 
science — paced 
audiotape 
Reading— printed 

Winter 1986 
1/6 - 1/31 

Approximately 15 
minutes (questions 
were read aloud to 
students ) 

13 minutes 



3 



Blue inkW, stapled 



Fill in oval 

Machine-scanned 

Hone 



1988 Bridge to 1984 

Age subsample of 
bridge date set 

4 

Reading, writing 



3,782 



Calendar year 
Jan. -Dec. 1978 

Printed 



Winter 1988 
1/4 - 3/lX 

Approximately 15 
minutes (questions 
were read aloud to 
students ) 

14 minutes 
28 minutes for the 
double-length block V 
(reading and writing) 

10 



Blue ink, saddle- 
stitched 

Circle letter 

Key- entered 

Language arts teacher 
was identified by 
students^^l 



1988 Bridge to 1986 
Age-only sample 



4 

Reading, mathematics, 
science 

3,711 



Calendar year 
Jan. -Dec. 1978 . 

Mathematics and 
scienco — paced 
audiotape 
Reading— printed 

Winter 1988 
1/4 - 3/11 

Approximately 15 
minutes (questions 
were read aloud to 
students) 

13 mirutes 



3 



Blue ink, stapled 



Fill in oval 

Machine-scanned 

None 



^4'^ throe wevks of the assessment, six minutes were allowed for students to complete the background 
of^; asses^men^* students did not understand the questions, backgrou,»<S items were read to the students for the Snder 

.^A n^n,.^i<5i^?^^i-/?''^ioL*'^'.t"''* "'"^ Average lino length was less than five inches in 1984 reading passages 

and^oyer five inches in 1986 reading passages. The 3988 bridge booklets were duplicated from the corresponding assessmint 

W Ho teacher data were collected for this sample. 



30 



ERIC 

a^| L-"L W " 'lfl .: !| J 



39 



Table 3.2 



Comparison of Data Us""d to Measure ..eading Trend, Age 13 



Characteristlca 

Do script Ion of S Ample 

Modal Grade 
Curriculum Areas 

Saisple Size 
(the nuisber of 
students ifith reading 
scale values) 

Ago Definition 
Method of Assessment 

Dates Assessed 

Tine—Cocxnon 
background block 

Time-'-Cognitive block 

Kuirber of Reading 
Blocks Administered 

Booklet Printing and 
Binding 

Response Mode 

Scoring Method 

Teacher Questionnaire 



1984 Sample 
Used for Trend 

Ago subsample of main 
KAEP a ample 

8 

Reading, writing 



17,535 



Calendar year 
Jan. -Doc. 1970 

Printed 



Fall 1983 
10/10 - 12/17 

6 minutes 



14 minutes 
28 minutes for each 
of the double-length 
blocks (U, V, H) 

1^ 



Brown ink, saddle- 
stitched 

Circle letter 

Key-entered 

Language arts teacher 
was identified by 
students 



1986 Sample 
Uaed for Trend 

Age-only sample 
8 

ReadingW, 

mathematics, science 
6,200 

Calendar year 
Jan. -Dec. 1972 

Mathematics and 
science' ^aced 
audiotapt^ 
Reading— printed 

Fall 1985 
11/4 - 12/13 

6 minutes 
16 minutes 

3 

Blue inkl^l, stapled 

Fill in oval 

Machine-scanned 

Kong 



1988 Brldjte to 1984 
1988 Bridge to 1986 
Age subsample of 
bridge data set 

8 

Reading, writing 



4,005 



Calundar year 
Jan. -Dec. 1974 

Printed 



Fall 1987 
10/12 - 12/18 

6 minutes 



14 minutes 

(no doubld-length 

block in these 

booklets) 

10 



Brown ink, saddle- 
stitched 

Circle letter 

Key-entered 

Language arts teacher 
was identified by 
student sl^i 



Age-only sample 
8 

Reading, mathematics, 
science 

3,942 

Calendar year 
Jen. -Dec. 1974 

Mathematics and 

science— paced 

audiotape 

Re ading — printed 

Fall 1987 
10/12 - 12/18 

6 minutes 
16 minutes 

3 

Blue ink, stapled 

5111 in oval 

Machine-scanned 

Kone 



u> format and content of the 1986 age 13 reading blocks were identical to those used at age 17. 

1^1 Slightly smaller type was used in 1986. Average line length was less than five inches in 1984 reading passages 
and over five inches in 1986 reading passages. The 1988 bridge booklets were duplicated from the corresponding assessmbnt 
years. 

No teacher data were collected for this sample. 



31 



ERIC 



40 



Table 3.3 

Comparison of Data Used to Measure Reading Trend, Age 17 



Characteristics 
Description o£ Sample 

Modal Grade 
Curriculum Areas 



Sample Size 
(the number of 
students with reading 
scale values) 

Age Definition 

Method o£ Assessment 

Dates Assessed 



Time— Coflinon 
background block 

Time— Cognitive block 



Ntmiber o£ Reading 
Blocks Administered 

A^^erage Session Size 

Booklet Printing and 
Binding 

Response hiode 

Scoring Method 

Teacher Questionnaire 



1984 Sample 
Used for Trend 

Age subsample o£ main 
NAEP sample 

11 

Reading, writing 



18,984 

Oct. 1966-Sept, 1967 

Printed 

Spring 1984 
3/12 - 5/11 

6 minutes 



14 minutes 
28 minutes £or each 
o£ the double-length 
blocks (U, V, W) 

12 



Approximately 20 

Black ink, saddle- 
stitched 

Circle letter 

Key-ontered 

Language turts teacher 
was identified by 
students 



1986 Sample 
Used for Trend 

Age subsample of main 
KAEP sample 

11 

Reading, mathematics, 
Science, computer 
coaspetence, 
historyl*J, 
literature^*] 

16,418 



Oct. 1968-Sept. 1969 

Printed 

Spring 1986 
2/17 - 5/2 

6 minutes 
16 minutes 

6 

Approximately 35 
Blue inkW, stapled 

Fill in oval 

Machine-scenned 

Up to 5 teachers were 
identified by 
students 



^988 Bridge to 1984 

Age subsample of 
bridge data set 

11 

Reading, writing 



3,652 

Oct. 1970-Sept, 1971 

Printed 

Spring 1988 
3/14 - 5/13 

6 minutes 



14 minutes 

(no double-length 

block in these 

booklets) 

10 



Approximately 20 

Black ink, saddle- 
-Mtched 

Circle letter 

Key- entered 

Language arts teacher 
was identified by 
students^^] 



1988 Bridge to 1986 

Age subsample of 
bridge data set 

11 

Reading, mathematics, 
science, history 



3,715 



Oct. 1970-Sept. 1971 

Printed 

Spring 1988 
3/14 - 5/13 

6 minutes 
16 minutes 



Approximately 35 
Blue ink, stapled 

Fill in oval 

Machine-scanned 

Up to 5 teachers were 
identified by 
students^^} 



W Four of the 97 booklets at age 17 contained one history block, one literature block, and one reading block (13RA). 
W Ko teacher data were collected for this sample. 




that each block appears paired with each other block in some booklet. In both 
1984 and 1986, each student received a common block containing background 
questions and three subject matter blocks containing assessment exercises in a 
specific subject area and a small number of background and attitude questions. 

The implementation of the BIB spiraling differed somewhat between 1984 
and 1986. In 1984, both reading and writing were assessed. Accordingly, 
students received some combination of reading and writing blocks. Some 
students in an assessment session received three reading blocks, others two 
reading and one writing block, others one reading and two writing blocks, and 
still others received three writing blocks with no reading blocks at all. The 
1986 design called for estimating trends in reading, mathematics, and science. 
Therefore, students in the samples intended for measuring trends were 
administered booklets that contained items from these three areas. 
Consequently, although many items were administered in both the 1984 and 1986 
assessments, the context in which they were administered differed; the items 
were arranged differently within blocks and reading was administered with 
subject areas other than writing. 

Another possibly important difference between 1984 and 1986 was the 
timing. In the 1986 assessment, an effort was made to increase the pool of 
items tliat could be administered. Accordingly, the time allowed to complete a 
subject area block was increased at ages 13 and 17 from 14 to 16 minutes. In 
order to improve the responses to background questions and to minimize the 
fatigue of the 9-year-olds, the blocks for the 9-year-olds were reduced in 
length to 13 minutes. To make these changes, items were rearranged and new 
blocks were fcrmed. The number of items within a block was altered to allow 
the student about the same amount of time per item in the two assessments. 

33 



ERLC 



42 



Estimating item response time, however, is only approximate, and, as will be 
shown late-., the number of students reaching the last items in the blocks was 
reduced. Since the reading items within the blocks were rearranged, an item 
that was near the beginning of a block in 1984 and reached by nearly all 
students might be near the end of a block in 1986 and not reaphed by a large 
proportion of students. 

A number of other well-intentioned changes were also incorporated into 
the 1986 assessment. For example, to speed up the reporting process, machine- 
scorable books, which had been used in assessments prior to 1984. were 
reinstated. The format of the assessment books was made more pleasing to the 
eye. The number of IT-year-olds in an assessment session was increased in 
order to reduce the burden of several sessions on the participating high 
schools. The time of year in which data were collected from the 1986 sample 
of 9-year-olds was restricted for operational efficiency. A special study of 
language minority students was also administered along with the 1986 
assessment. 

Since the 1984 and 1986 samples were measured somewhat d5.fferently, 
changes in student performance are confounded with changes in measurement 
procedure. The 1986 reading anomaly made it no longer tolerable to assume 
that the change due to measurement procedure was small enough to be ignored. 
Unfortunately, without further information, the effect of the change in the 
measurement procedure could not be estimated and removed from the trend 
e:stimates on the basis of the 1984 and 1986 data alone. Therefore, new data 
had CO be collected to estimate the effect of the changes in the measurement 
procedure. 



* * * 



34 



ERIC 



43 



In order to report the trend results as quickly as possible, appropriate 
data were needed to distinguish between differences resulting from student 
performance and differences resulting from measurement improvements ♦ To 
measure the effect of each of the several 1986 improvements in measurement 
procedure, as well as the interactions among them, would require, a very 
complex research design and then a very large data collection effort, an 
effort so large that the 1988 reporting would also be likely to be delayed. 
Instead, the 1988 assessment design was enlarged in such a way that the net 
effect of all changes in measurement procedure could be estimated, although 
the effect of each individual change could not. Using this net effect, it is 
possible to study the overall effect of measurement changes on the 1984 and 
1986 reading proficiency data. 

The general strategy for the redesign was to collect two samples of data 
from the population of students at each age level. One of the samples at each 
age level would be measured using the 1984 booklets and procedures and the 
other using the booklc^ts and procedures of 1986. Although the data would be 
collected in the 1988 assessment, the measurement systems of the 1984 and 1986 
trend assessments wenld be duplicated as closely as possible. Since the pairs 
of samples were to be selected from the same 1988 populations, their estimated 
distributi .ns of reading proficiency should be identical, except for sampling 
error. If the estimated distributions differed by more than could reasonably 
be expected from the sampling process, then the differences in the estimated 
distributions con' ^ be attributed to changes in the measuring systems. 



35 



ERLC 



44 



The revised design included the following samples for use in 
distinguishing between changes in performance and measurement: 

The brldse-to-1984 samples, The 1988 assessment had been designed to 
assess reading, writing, civics, and U.S.. history. The design already 
Included special bridge samples for estimating trends in reading and writing 
at all age levels because the 1988 NAEP overall design also introduced several 
changes from past assessment practices. The decision had already been made to 
bridge reading and writing performance back to 1984, since there had been a 
full assessment in both these subjects in that year. The unusual and 
unexpected results in the 1986 reading assessment resulted in redoubling the 
effort to make the measurement in the bridge to 1984 as close as possible to 
an exact re-creation of the 1984 measurement. The students in these samples 
were given copies of reprints of selected 1984 booklets, and the assessment 
was administered using the 1984 administration procedures, including the same 
block timings. The K-asurement difference between this 1988 sample and the 
1984 sample was thus minimized. 

brldge-to-1986 samples. In order to duplicate the 1986 methodology, 
the 1988 design was modified by adding an additional sample at each age level. 
These samples were selected from the sams age populations that were used in 
the previous trend analyses and, of course, the same populations as in the 
samples from the bridge to 1984. Selected booklets that were used by the 1986 
trend samples were administered to these 1988 samples, duplicating as closely 
as possible the 1986 administrative procedures, including the timing. 



36 



ERIC 



45 



Although all age populations are defined in exactly the same way for the 
bridge -to- 1986 sample as for the bridge-to-1984 sample, the measurement system 
for the bridge- to- 1986 sample differs at different age levels whereas the 
measurement system for the bridge-to-1984 sample does not. Duplicating the 
differences in the 1986 samples used for estimating trends, the assessment 
procedures for the bridge to 1986 at ages 9 and 13 differed from the 
procedures at age 17. 

At ages 9 and 13, the trend samples in 1986 were given reading items 
along with mathematics and science items. The reading data were to be 
compared to the BIB spiraled data collected in 1984. In the previous 
assessments with which the 1986 mathematics and science data were to be 
compared, the measurement had been administered using a tape recorder to pace 
students uniformly through the assessment items. In 1986, the same samples of 
students were used for estimating trends in reading, mathematics, and science. 
To accommo.date the differences in procedure, the trend data in 1986 were 
collected using a pseudo-BIB design that attempted to >iend the BIB spiraling 
of 1984 with the paced administration of previous assessments. 

To do this, each student was administered one block of items from each 
subject area. The mathematics and science items were individually paced using 
a tape recorder, as in past mathematics and science assessments. The recorder 
was turned off when the reading block was administered, and the reading block 
was timed as a single unit. Each of the trend reading blocks was, therefore, 
administered as a single unit in a manner similar to the 1984 assessment, but 
the tape recorder was turned on for the blocks of mathematics and science 
items. Since selected 1986 booklets were administered in the same way to the 
1988 bridge- to-1986 samples at ages 9 and 13, the reL'ult was that some 

37 

ERLC 



mathematics and science data were collected in 1988 as a byproduct of the 
reading anomaly study, although such data were not part of the original 
assessment design. 

Since the definition of 17-year-olds did not change in the main part of 
the 1986 assessment, the definition of the 17-year-old population remained 
comparable to all previous assessments, and so the pseudo-BIB design was 
deemed unnecessary at this age level. The 1986 reading trend for 17-year-olds 
was based on the main NAEP sample, which was BIB-spiraled as in 1984. 
Students in this sample were administered some combination of reading, 
mathematics, science, computer competence, U.S. history, and literature 
blocks. (The estimates of trends in mathematics and science were based on 
separate samples that were paced through the items using a tape recordei:, as 
in dieir comparison samples; there were no estimates of trends in computer 
competence, U.S. history, or literature since these were newly developed.) 
Therefore, some of the BIB-spiraled booklets from 1986 containing reading 
blocks were selected for administration in 1988 to the bridge- to-1986 sample, 
and the 1986 procedures were duplicated as closely as possible. Since 
reprinting exact 1986 booklets was required, and most of the 1986 BIB booklets 
containing reading blocks also contained, mathematics and science blocks, the 
1988 bridge -to -1986 sample also includes samples of 17-year-olds who were 
assessed in portions of the 1986 mathematics and science materials. 

* * * 

To summarize, the analyses in this report are based primarily on four 
samples of data at each of the thr^e age levels. The samples are: 

• the 1984 main NAEP sample, collected during the 1984 assessment; 



38 

47 



• the 1986 reading trend sample, collected during the 1986 
assessment; 

• the bridge- to- 1984 sample, collected during the 1988 assessment; 
and 

• the bridge-to-1986 sample, collected during the 1988 assessment. 

The properties of these samples have been stimmarized in Tables 3.1, 3.2, and 
3.3. 

Comparing the bridge- to- 1984 and the bridge-to-1986 samples is of 
methodological interest, since both samples were drawn from the same student 
populations, at the same time, and are thus identical in principle in reading 
proficiency. Any differences in estimated reading proficiency must, 
therefore, be attributable to the differences introduced by changing the 
measurement procedures and to those inherent in random sampling. The major 
part of estimating the effect of measurement procedures is comparing the 
estimates from the two randomly equivalent bridge samples. 

However, it should be noted that exact duplication of procedure is 
impossible in practice and a few compromises had to be made. For example, 
since it was considered important to have the two 1988 bridge samples 
comparable to each other, the bridge-to-1986 sample for 9-year-olds was 
assessed between January 4 and March 11, 1988, although, as noted above, the 
age 9 trend assessment in 1986 occurred in January only. Also, at age 17, it 
was not feasible to assess the bridge-to-1986 students in sessions as large as 
those in 1986. However, earlier research had shown that the nvimber of 
students in an assessment session did not have a substantial effect on 
performance at age 17. 



39 

48 



Under the assumption that these assessment forms measured reading in 
the same way in 1988 as when the identical forms were lest used, the 
comparison of the bridge-to-1984 data with the actual 1984 data is of 
substantive interest, since it estimates the trend in reading proficiency in 
the metric of the 1984 assessment technology. Likewise, the comparison of the 
bridge -to -1986 data with the actual 1986 data is of substantive interest, 
since it estimates the trend in reading proficiency from 1986 to 1988 in the 
metric of the 1986 assessment technology. The next chapter will give an 
overview of the results from these comparisons. 



40 

49 



Chapter 4 
OVERVIEW OF RESULTS 



Albert E. Beaton 

Rebecca Zwick 
Kentaro Yamamoto^ 



The data collected in 1988 from, the augmented NAEP design, which was 
described in the last chapter, have now been analyzed. With these data, the 
ETS/NAEP staff continued its research into explaining the 1986 reading anomaly 
and obtained improved estimates of reading performance in 1986. The improved 
estimates were shown in Figure 1.2 of Chapter 1. The research has led us to 
conclude that the changes in the residing assessment instruments and 
administrative procedures that were introduced between 1984 and 1986 had a 
major effect on the 1986 estima::es of reading proficiency. A summary of this 
research is shown in the next section of this chapter, which describes the 
effect of changes in measurement procedures. The following section summarizes 
how the data collected during the 1^88^ assessment were used to improve the 
1986 estimates of reading prof icirmcy . 

It should be noted that all £ ^rvey results are subject to error, and 
NAEP uses the best available technolog^^ to estimate the standard errors for 
the statistics that it publishes. As assessment technology matures, and as 
new insights into the application o.^ existing technology appear, there is an 

^The figures in this chapt.er were ;.^roduced by Jo -Ling Liang and David 



Freund. 



41 




opportunity to Improve the estimated values of past and present surveys. In 
addition to adjusting the 1986 results for the effects of changes in item 
context and administration procedures, we have taken the opportunity to 
improve estimates of student performance wherever possible, although the sizes 
of the changes were trivially small. The next section of this chapter 
summarizes all of the improvements in reading proficiency estimates that were 
made between the publication of The Reading Report Card: Progress Toward 
Excellence In Our Schools (1985) and this report. 

For completeness, the results of the analyses of newly available data on 
mathematics and science proficiency are presented. These data were collected 
as a byproduct of the modification of the NAEP design to include reading 
samples that were measured in the same way as in 1986. Although these data do 
not meet the usual standards of a full NAEP assessment for a subject area, 
they were analyzed in hopes of generating alternate hjrpotheses for the reading 
anomaly, but did not seem to do so. 

Finally, this chapter presents its conclusions and a discussion of 
continuing research. 

The Effect off nhanPfts in Measurement Procedures 

The redesigned 1988 assessment permitted an estimate of the effects of 
the changes in the measurement instruments and administrative procedures 
between 198^ .nd 1986. At each grade level, two randomly equivalent samples 
of students were assessed, the assessment of the bridge-to-1984 sample 
duplicating as closely as possible the 1984 assessment system and that of the 
bridge- to- 1986 sample duplicating the 1986 methodology. Since both sets of 
samples came from identical populations, the population distributions of 

42 



ERIC 



51 



reading proficiency is in principle the same for the pairs of samples at each 
age level. If there were no differential effect due to measurement procedure, 
sample estimates of these distributions would be the same, except for sampling 
error; since the variance of the sampling error is estimable, any differences 
between the samples that are excessive in light of the sampling error must be 
due to the measurement process. Thus, the effects due to changes in the 
measurement process could be estimated by comparing the estimated 
distributions of reading proficiency for the pairs of samples. 

The comparisons between the estimates of the distributions of reading 
proficiency for the pairs of age -equivalent samples showed substantial 
differences. Figure 4.1 shows these results graphically. The solid lines 
show the estimated trend at each age level from 1971 to 1988, omitting the 
point for 1986, since it differed in measurement procedure. These trend 
lines^ are what we would have estimated if there had been no assessment in 
1986 and thus no anomalous 1986 data. 

The dotted lines show where the estimated trend from 1986 to 1988 would 
have been (1) if the reading anomaly had been ignored, (2) if the unmodified 
1986 data had been used for trend estimation, and (3) if the data from the 
198C bridge- to- 1986 data ha4 been used to estimate reading proficiency in 
1988. 



^The trend line in Figure 4.1 contains the same estimates for 1971 to 1984 
as Figure 1.2. The 1988 estimates in Figure 4.1 used a conditioning model to 
maximize the comparability between the two 1988 bridge samples; in Table 4.1 and 
Figure 1.2, the conditioning model maximized comparability between the 1984 and 
bridge-to-1984 data. The differences between the two sets of estimates are less 
than 0,3 points for age- level means. The figures in Figure 1.2 are used in the 
most recent reading trend report (Mullis & Jenk5.ns, 1990). See Footnote 4 in 
Chapter 5 of the present report. 

43 



ERIC 



52 



Figure 4.1 



Reading Scale Results 
1971 - 1988* 




* standard errors of means are ai)proximately 1.0. Bands extend from two standard 
errors beloK to two standard errors above the moan. Appendix A (p. 171) gives a sunxnary ot 
which modifications of reading scale results are used in the tables and figures in this 
report. 



Weighted Reading Proficiency Moans and Standard Errors 
Age g (S,E.>. Am 13 (S.E.'^ Age 17 (S.E. 



1971 207.3 (1.0) 255.2 (0.9) 285. ^ (2 

1975 210.2 (0.7) 256.0 (0.8) 286.1 (0^8) 

1980 21A.8 (1.1) 258.5 (0.9) 285.8 (l.A) 

198A 211.0 (1.0) 2?7.1 (0.7) 288.8 (0.9) 

1986 208.9 (1.2) 259.4 (1.0) 277. A (1.1)** 

1988 Br. to 84 212.1 (1.1) 257.1) (0.9) 289.9 (1.3) 

1988 Br. to 86 214.0 (1.0) 263.7 (0.8) 281.9 (i.4) 



** Standard error differs from column 4 of Table 4.1 because of a change in 
Jackknife methodology. 



44 



In Figure 4.1, there are two points In 1988 at each age level, one 
representing the estimate made from the bridge- to-1984 sample and the other 
from the bridge-to-1986 sample, and the differences between the pairs of 
points are estimates of the differences attributable to the changes in 
measurement and administrative procedures. The graph shows these differences 
in the context of the other changes that have bee^ observed since NAEP began 
measuring reading trends. The estimated effects differed by age level: 



At age 9, the estimated average reading proficiency score was 
slightly higher (1.9 points) for the 1988 bridge to 1986 than for 
the bridge to 1984. 

At age 13, the estimated average reading proficiency was 
noticeably higher (6.2 points) for the 1988 bridge to 1986 than 
for the bridge to 1984. 

At age 17, the estimated average reading proficiency was 
substantially lower (8.0 points) for the 1988 bridge to 1986 than 
for the bridge to 1984. 



Except for the 9-year-old students, the differences due to chan in 
measurement procedure are larger than the changes in reading pr^ -ciency since 
NAEP first assessed reading. 

Not only was the average reading proficiency affected by the changes in 
measurement procedure, but the variance was also affected. The estimated 
distributions of reading proficiency from the 1988 bridge-to-l''86 data had 
larger variances at ages 13 and 17 than the variances estimated from the 
bridge -to -1984 samples. The distributions of reading proficiency estimated 
from the 1988 bridge-to-1984 and bridge-to-1986 san.pies are shown in Figures 
4.2, 4.3, and 4.4. Tho.se distributions were estimated before the common- 
population equating discussed in the next section. The estimated 
distributions are reasonably similar at age 9, show the tendency for higher 

45 



54 



Figure 4.2 

Estimated Reading Proficiency Distributions for 1988 Bridge Samples, Age 9 




Figure 4.3 

Estimated Reading Proficiency Distributions for 1988 Bridge Samples, Age 13 




Figure 4.4 

Estimated Reading Proficiency Distributions for 1988 Bridge Samples, Age 17 



p 




r 




o 


P 


P 


e 


o 


r 


r 




t 


10- 


i 




o 


P 


n 


0 




i 


of 


n 




t 


P 




o 


I 


P 


n 


u 


t 


1 




a 


r 


t 


V 


i 


a 


o 


1 


n 





so L I O 
OA SHCO 



BR I DGE TO 1 984 
BKi • OGE TO 1986 



O . O 8 




333 a?3 



Beading Proficiency Scale 



48 




57 



scores and increased variance at age 13, and the vastly increased percentage 
of lower scores along with a slight increase in high scores for the 17 -year- 
olds. 

Clearly, tha differences due to measurement changes are substantial and 
unacceptable. The neasurement changes are also reflected in the responses to 
specific items; the differences in the proportion of students correctly 
answering individual items showed similar results. The differences are 
present in the basic item data and are not, therefore, attributable to the IRT 
scaling technology, 

Reestiipating Reading Proficiency in 1986 

The original procedure for equating the scales developed in 1984 and 
1986 rested on the assumption that an item would function in tte same way in 
different contexts. The availability of common populations made it possible 
to do another type of eqtiating between the two different measurement 
procedures. Since the two 1988 bridge samples represented randomly equivalent 
populations, it was possible to use common-population equating methods, 
eqtiating the distributions of proficiency without reliance on the consistency 
of parameters for the common items. In this way, the item parameters of the 
items common to both samples were allowed to vary as appropriate for the 
different contexts. This alternate approach restilted in a satisfactory 
eqtiating of the bridge ''o- 1986 samples to the bridge-to-1984 samples. 

Thus, the relationship between the reading proficiency measurements from 
the 1984 and 1986 assessment technologies were developed by equating the 
results from the two randomly equivalent 1988 samples, the bridge to 1984 and 
the bridge to 1986. Assuming that the relationship between the 1984 and 1986 

49 



58 



forms had not changed, the 1986 reading results were transformed into the 
metric of the 1984 assessment. The 1986 results transformed into the 1984 
metric were used as the modified estimates of reading proficiency in 1986. 
The processes of equating and transformation between the two assessment forms 
are presented in Chapter 6. 

Although the more precise equating procedures described in Chapter 6 
were used, the general concept of transforming the 1986 reading proficiency 
estimates into the metric of the 1984 assessment can be thought of more simply 
in graphic terms by returning to Figure 4.1. We know a priori that the pairs 
of points in 1988 are from identical populations of students, which are thus 
identical in reading proficiency, except for sampling error. Assuming that 
the average sampling error is close enough to zero to ignore, the observed 
differences between the pairs of points is due to the measurement methodology. 
Although a nonlinear transformation^ was actually used, moving the 1988 
bridge-to-1986 points linearly so that they coincide with the equivalent 
bridge- to-1984 points is a simple way to adjust for the effect of measurement 
technology. Moving the actual 1986 points by the same amounts approximately 
transforms them into the 1984 reading metric. 

Modifica tions of the Reading Trend Lines 

There has been a major improvement in the reading proficiency estimates 
for 1986, which was discussed in the last section, and three minor 
improvements in the general estimates of trend, which resulted from other 
improvements in assessment technology. The effects of these improvements on 
the reading trend lines are presented in Table 4.1. These improvements are 



^See Figures 6.8 to 6.10 in Chapter 6. 

50 



ERIC 



59 



Table 4.1 



Effects of Various Changes on Trend Values 



Age 9 









Technical 












Addition of 


Report on 




" "'•suiting 


from 




Heading 


Conditioning 


Reading 






(Hiange in 




Report Card 


Variables- 


Anomaly 


Context 


Heights 


Conditioning 


Year 


Estimate M 


1985 W 


Estimate M 


Adjustment 


Adjustment 


Model 


1971 


207.2 (1.1) 


0.1 


207.3 (1.0) 








1975 


209.6 (0.7) 


0.6 


210.2 (0.7) 








1980 


213.5 (1.1) 


1.3 


214.8 (1.1) 








1984 


213.2 (0.9) 


-0.3 


212.9 (1.0) 




-l.S 




1986 






207.3 (1.4) 


-0.0 




+1.6 


1988 















Modified 
Estimate 

207.3 (1.0) 
210.2 (0.7) 
214.8 (1.1) 
211.0 (1.0) 



208.6 
211.8 



(1.9) 
(1.2) 



Age 13 









Technical 






Addition of 


Report on 




Reading 


Conditioning 


Reading 




Rei>ort Card 


Variables- 


Anomaly 


Year 


Estimate 


1985 W 


Estimate W 


1971 


253.9 (1.1) 


1.3 


255.2 (0.9) 


1975 


254.8 (0.8) 


1.2 


256.0 (0.8) 


1980 


257.4 (0.9) 


l.i 


258. S (0.9) 


1984 


257.8 (0.6) 


0.2 


258.0 (0.7) 


1986 






260.4 (1.1) 


1988 









Change Resulting ftom 

Change in 

Context Heights Conditioning 
Adjustment Adjustment Model 



-4.4 



-0.9 



-1.0 



Modified 
Estimate 



255.2 
256.0 
258.5 
257.1 
£55.0 
257.5 



(0.9) 
(0.8) 
(0.9) 
(0.7) 
(1.6) 
(0.9) 



Age 17 









Technical 






Addition of 


Report on 




Reading 


Conditioning 


Reading 




Report Card 


Variables- 


Anomaly 


Year 


Estimate M 


1985 W 


Estimate 


1971 


284.3 (1.2) 


1.1 


285.4 (1.2) 


1975 


284.5 (0.7) 


1.6 


286.1 (0.8) 


1980 


284.5 (1.1) 


1.3 


285.8 (1.4) 


1984 


288.2 (0.9) 


0.6 


288.8 (0.9) 


1986 






277.4 (1.0) 


1988 









(Hiange ResultiDs 



Context 
Adjustment 



+8.6 



Heights 
Adjustment 



from — 

C^anse in 

Conditioning 

Model 



Modified 
Estimate 



285.4 
286.1 
285.8 
288.8 
286.0 
290.1 



(1.2) 
(0.8) 
(1.4) 
(0.9) 
(1.7) 
(1.1) 



W From The Reading Retx)rt Card (1985, p. 65). 

^1 Additional conditional variables were added in 1985. 

From The NAEP 1985-86 Reading Anomaly; A Technical Report (Beaton, 1988a, p. 7). 



51 

ER?C CO 



stumnarized here in the same order as presented in Table 4.1 and discussed in 
more detail below. 



a minor improvement by increasing the ntunber of variables used in 
the conditioning process, which affected the reading trend 
estimates for the 1971 to 1986 assessments at all age levels 

a major improvement ' the 1986 estimates of reading performance 
at all three age lev due to adjusting for the effect of the 
changes in measurement instriiments and administrative procedure 

a minor improvement in the 1984 estimates of reading performance 
at ages 9 and 13 that resulted from a reanalysis of the 1984 
sampling weights 

a miror improvement in the 1986 estimates of reading performance 
at ages 9 and 13 resulting from a change in the conditioning model 



Table 4.1 is presented in three parts, one for each of the age 
populations assessed. The first column of each part of this table is the year 
in which the assessment took place. 

The second column is the former trend estimate, which is taken from The 
Reading Report Card: Progress Toward Sxcelle:ice in Our Schools (1985). For 
each age population, the estimated average reading performance is recorded 
for each year that reading was assessed up through 198^. The estimated 
standard error for the average is given in parentheses. 

The third column of the table contains the effects of adding 
conditioning variables to the psychometric model. Conditioning is a process 
by which estimates of proficiency distributions can be improved by 
incorporating student background variables as well as item responses, assuring 
consistent estimates of population parameters if the conditioning model is 
accurate. The conditioning process, which comprises one phase of proficiency 
estimation, is described by Mislevy (1988) and its application in NAEP is 



ERLC 



52 

61 



described in other chapters in the 1986 NAEP technical report (Beaton, 1988b) • 
When the 1984 data were analyzed, the computer program limited the number of 
variables that could be conditioned, and thus the conditioning process was 
restricted largely to basic demographic variables (e.g., region, 
race/ethnicity, and sex). In 1985, programming capacity was expanded. In 
order to assure the comparability of all assessment results, all reading 
proficiency results for 1971 through 1984 were reconditioned using an extended 
model, which included more conditioning variables. The extended conditioning 
model was also used in all analyses of 1986 data and some analyses of 1988 
data (see Footnote 2 and Chapter 5) . The largest effect of extending the 
conditioning model was a 1,6 point increase for the 1975 sample of 17-year- 
olds. 

The fourth column of Table 4.1 contains the reading trend estimates that 
were reported in The NAEP 1985 '86 Reading Anomaly: A Technical Report (Beaton, 
1988a) . This column contains the estimated average reading performances (and 
their estimated standard errors in parentheses) for students at each age 
level. Since the reconditioned estimates discussed in the previous paragraph 
were already available, they were used in this report. No estimate of 1988 
proficiency was available at the time the technical report wa.. published. It 
was primarily these trend estimates that signalled the anomaly, resulting in 
the further investigations. 

The next three columns in the table present changes in the reading trend 
estimates fhat occurred between the publication of The NAEP 1985-86 Reading 
Anomaly: A Technical Report (Beaton, 1988a) and this report. 

The column labeled "Context Adjustment" contains the estimated effects 
of the changes in measurement instrumentation and in administrative procedure 

5? 

ERiC , 62 



on the 1986 estimates, which were discussed In the previous two sections. 
This adjustment was based on the common-population equating of the brldge-to- 
1986 samples to the bridge- to-1984 samples • Using common-population equating 
instead of common-Item equating allowed for the possibility that the Items 
common to the 1984 and 1986 assessment forms were functioning differently, and 
that form and administration changes may have had a different impact at each 
age level. In the eouatlng process, a linear function was determined to match 
the first two moments of the estimated reading proficiency distributions of 
the bridge-to-1986 to the bridge -to -1984 samples at each age level. These 
equating functions implied a set of transformations for the 1986 item 
parameters. Using these transformed item parameters, the 1986 data were 
adjusted to derive the modified 1986 results. The adjustment procedure and 
its effects are described in detail in Chapter 6. It is noteworthy that the 
effects on average reading performance vary by age level, with a trivial 
negative effect (-0.3 points) at age 9, a larger .gatlve effect (-4.4 points) 
at age 13, and a very large positive effect (+8.6 points) at age 17. 

The sixth column, labeled "Weights Adjustment," shows the changes 
resulting from a reanalysis of the 1984 weights, filstorically, NAEP has 
defined the ages of 9- and 13-year-olds on an October- to-September basis and 
the age of 17-year-olds on a calendar-year basis, and it is necessary to 
continue these definitions to maintain trends. For the computation of the 
1984 sampling weights, a common algorithm for poststratlflcation was used 
across the three age levels. Upon review of the weighting procedures, it was 
noted that the 1984 estimate of the percentage of 9-year-old students in 
fourth grade and 13 -year- old students in eighth grade could be Improved and 
made consistent with other assessment years by applying a different algorithm, 

54 

ER?C 63 



and so, in 1989, this algorithm was applied to the 1984 data at ages 9 and 13. 
This modification led to a decrease of 1.9 points at age 9 and 0.9 points at 
age 13. The dec5iils of sampling and weighting procedures are described in 
Appendix B. The details of the adjustment shown in this column are given in 
Appendix C. Except for the previous columns of this table and Figure 1.1, all 
results in this report are based on the modified weights. This modification 
affects only the two younger age levels in 1984. 

The sixth column, labeled "Change in Conditioning Model," shows the 
effect of a second change in the conditioning model at ages 9 and 13. In 
1986, the trend and cross-sectional data were conditioned together at all age 
levels. Differences in age definition betwee i the trend and cross-sectional 
samples changed modal grades for 9- and 13-year«olds and resulted in a less- 
than-optimum estimate of the effect of a student being above, at, or below the 
u.'sual grade for his or her age. To improve the estimates, the 1986 trend data 
were conditioned separately in a 1989 analysis. The details of this 
modification are described in Chapter 6. The change in the conditioning model 
affected only ages 9 and 13 in 1986, since the age of the 17-year-old 
population had been defined in the same way for both the trend and cross - 
sectional samples. The larger effect was a 1.6 point increase in the 
estimated average of the 9-year-olds. 

The final column contains the net effect of the revised estimates of the 
average proficiencies (and their standard errors) as depicted graphically in 
Figure 1.2. These estimates incorporate the four changes described above. ^ 
As mentioned in Chapter 1, the revised estimates indicate a slight decline in 

^Appendix A (p. 171) gives a summary of which modifications of reading 
scale results are used in the tables and figures in this report. 

55 

64 



reading proficiency at each age level in 1986 and a rebound in 1988. However, 
at each age level, the ^ timated 1986 decline from 1984 is between two and 
three points on the NAEP reading scale and is not statistically significant. 
Also at each age level, the estimated 1988 level is essentially the same as 
the 1984 level, with the largest being a (nonsignificant) 1.3 point gain at 
age 17,^ 

The effect of the changes in measurement systems seems to have explained 
most, but not all, of the anomalous 1986 estimates of reading proficiency. 
Although the slight dip In proficiency at each age level are not individually 
statistically significant, the fact that all three ages show such similar 
results leaves some concern that another unknown factor also slightly affected 
the 1986 reading data. 

Mathematics and Science Results 

As mentioned in the last chapter, the redesign of the 1988 assessment 
resulted in the additional collection of some data on student performance in 
mathematics and science. The data were analyzed in the hope that they might 
shed some light on the 1986 reading anomaly. In this section, we will explore 
the implications. 

Before proceeding, several important differences between the reading 
assessment and the mathematics and science assessments in 1988 should be 
noted. In 1988, the measurement of mathematics and science was done using the 
same assessment booklets and procedures as were used in the 1986 assessment. 
As previously noted, the 9- and 13-year-old students were assessed in 

^See The Reading Report Card, 1971 to 1988: Trends from the Nation's Report 
Card (Mullis & Jenkins, 1990) for a detailed discussion of reading proficiency 
between 1984 and 1988. 

56 

65 



mathematics and science assessments using a tape recorder as in 1986 and in 
all past assessments, but the 17-year-old students were assessed by BIB 
spiraling (without a tape recorder), as in 1986. NAEP included only one set 
of samples for which mathematics and science were assessed; therefore, the 
effect of different forms could not be investigated. What could be 
investigated was whether or not the student populations appeared to improve or 
decline in estimated performance in mathematics and science between 1986 and 
1988. 

Before discussing these results, it should be noted that the NAEP scales 
in different subject areas are not directly comparable. The reading, 
mathematics, and science scales are arbitrarily set, and neither a scale point 
on one scale nor the differences between two scale. points on thac scale should 
be compared to those of another scale. The fact that the mathematics average 
is always higher than the science average at age 17 does not imply that the 
vverage 17 -year-old student knows more mathematics than science; the scales 
are not comparable. We do believe, however, that the directions (and perhaps 
magnitudes) of change in scale performance are somewhat comparable, and we 
wished to see if changes in performance in mathematics and science between 
1986 and 1988 were in the same direction as the estimated increase in reading 
performance. The magnitude of the changes could be expected to be different 
on the different scales. 

Estimates of the trends in performance for mathematics and science are 
shown respectively in Figuies 4.5 and 4.6. To show the amount of change 
between 1986 and 1988 in the context of the average performances that have 
been estimated in past assessments, the trend lines include performance 
estimates from all previous assessments in each subject area. The years for 

57 

ERIC 



c 



Figure 4.5 



Trend of Proficiency Scale Means for Mathematics 
1973 - 1988* 




* 1973 results wore interpolated for this plot. Bands extend from two standard 
errors below to two standard errors above the mean. 



Mathematics Scale Means and Standard Errors 

Voar A^e 9 (S.E.) Ajte 13 (S.E.) Af.a 17 (S,E.> 

1973 219. 1* 266 .0* 30« . 

1978 218.6 (0.8) 264.1 (1.1) 300,4 (0.9) 

ld02 219.0 (1.1) 263.6 (1.1) 298.5 (0.9) 

1906 221.7 (1.0) 269.0 (1.2) 302.0 (0.9) 

1988 229.0 (1.1) 273.3 (0.8) 305. /i (1.2) 



58 




67 



Figure 4.6 



Trend of Proficiency Scale Means for ,'Jcience 
1969 - 1988* 




* 1969, 1970, and 1973 results were Interpolated for this plot. Bands extend fron 
two standard errors belovr to two standard e'rrors ehave the mean. 



Science Scale Heans and Standard Errors 



Year 


Afce 9 (S.E.) 


Axe 13 (S.E.) 


Am 17 (S.E,) 


1969 






304.8* 


1970 


224.9* 


254.9* 




1973 


220.3* 


249.5* 


295.8* 


1977 


219.9 (1.2) 


247.4 (1.1) 


289.6 (1.0) 


19B2 


220.9 (1.8) 


250.2 (1.3) 


283.3 (1.1) 


1966 


224.3 (1.2) 


251.4 (1.4) 


288.5 a. 4) 


1988 


228.9 (1.3) 


257.3 (0.9) 


294.2 (1.5) 



59 



which estimates are available are shown on the abscissa; these years differ 
for the two subject areas because they were not usually assessed together 
until 1986 and 1988. The figures show the estimated average performance on 
the mathematics and science scales for the different age levels. The details 
of the analyses that led to these estimates are described in Chapter 7. 

What is at first most striking in these graphs is that estimated changes 
in average performance in both mathematics and science are similar to the 
changes in average reading performance between 1986 and 1988, In reading, the 
modified trend lines, shown in Figure 1.2, shx^w a slight increase in average 
reading performance from 1986 to 1988 at all age levels; in both mathematics 
and science, a similar slight rise also occurs at all age levels. Thus, the 
estimates are consistent for the trend lines in all subject areas. 

The fact that the nine (three subject areas by three age levels) 
estimated changes between 1986 and 1988 are consistent is not truly 
surprising, however, since the nine estimated changes are not independent. At 
each age level, the changes in reading, mathematics, and science are based on 
one sample of students that was assessed in all three subject areas in 1986 
and another sample that was similarly assessed in 1988^, 

The evidence from the mathematics and science data, therefore, did not 
suggest any other reasons for the anomalous reading data. But, for the 



^At ages 9 and 13 in both 1986 and 1988, exactly the same students were 
used for measuring trend in all subject areas. At age 17 in 1986, the student 
samples were partially overlapping, with individual students assessed in one, 
two, or three subject areas. The 1988 samples of 17 -year-olds were also partially 
overlapping, using six BIB-spiraled booklets: one booklet contained three reading 
blocks, two booklets contained one reading block and two maCnematics blocks, two 
booklets contained one reading block and two science blocks, and the final 
booklet contained one reading, one mathematics, and one science block. 

60 



69 



reasons cited above, and since analysis of these data might have suggested 
other hypotheses, they are included in this report. 



Conclusions and Con3:inulng Research 

These investigations into the reasons for the apparently anomalous 
reading data in 1986 have resulted in modified estimates of reading 
performance. The reading trend lines do not now seem anomalous, and the 1986 
estimates, now slightly lower than the 1984 estimates at each age level, are 
within the boundaries that cpuld be expected from the r.andom sampling process 
if there had been no reading proficiency changes in the student populations. 
Although the results are now reasonable, they are not conclusive. More 
research should be done to assure that no other 'major factors affect the 
accuracy of assessment results. 

The major contributor to the unusual 1986 results was the effect of 
changes in the measurement system, which in this case included changes in 
assessment context and administrative procedures. The present research shows 
that these changes had a substantial and unpredictable effect on reading 
proficiency estimates. The 1988 assessment included randomly equivalent 
samples in which the different measurement systems were used, and so equal 
population equating instead of equal item-parameter equating could be used to 
equate the two measurement systems. The equal population equating resulted in 
the trend line modifications that make the reading trend seem reasonable. 

The 1990 assessment will collect additional data that may shed more 
light on the 1986 reading results. As in 1988, the 3990 assessment will 
contain two randomly equivalent samples, one, of which will be measured using 
the 1984 measureipent system and the other using the 1986 measurement system. 



61 



ERLC 




The analyses of the 1988 equivalent samples that resulted in the modified 
trend estimates, vhich were reported above, can be repeated using the new 1990 
data. The stability of the equal population equating process can thus be 
estimated. The stability of item parameters within a particular measurement 
system can also be further investigated. 



62 




chapter 5 

ANALYSES OF 1988 READING BRIDGE DATA 
Rebecca Zwick^ 



Overview 

As described in Chapter 3, a bridge to 1984 and a bridge to 1986 were 
included in the 1988 assessment, each incorporating test booklets and- 
administration procedures that replicated as closely as possible those of the 
corresponding assessment year. The analysis of data from these bridges was 
expected to shed further light on the causes of the anomalous reading results 
in 1986. Three types of comparisons are discussed in this chapter: 

(1) comparison of the 1988 bridge to 1984 with the 1988 brid-^e to 1986, 

(2) comparison of the 1984 assessment with the 1988 bridge to 1984, and 

(3) comparison of the 1986 assessment with the 1988 bridge to 1986. The first 
of these comparisons has the most direct bearing on the anomaly; the second 
two comparisons yield estimates of changes in rea-iing proficiency that are 
unconfounded with changes in the assessment instruments and conditions. 
Comparisons are given both in terms of item percents correct and in terms of 
reading scale values. Table 5.1 lirts the samples on which this chapter is 
based, along with the number of students, number of reading blocks, and time 
of testing. Sampling procedures for the bridges are described in Appendix B. 

^David Freund provided statistical programming, with assistance frcm Minhwei 
Wang and Kate Pashley. David Freund and Jo -Ling Lirug produced the figures in 
this chapter. Robert Mislevy provided consultation on scaling. 

63 



Table 5.1 



NAEP Samples Used In 1988 
TMvestigation of 1986 Reading Anomaly* 



1984 
Age 9 
Age 13 
Age 17 



SagBple 



Age subsample of spiral 
Age subsample of spiral 
ilge subsample of spiral 



Huznbor of Nuzziber o£ 
Students Reading 
(Scaled Results) Blocks ** 



16,799 
17,535 
18,984 



12 
12 
12 



Time of Testing 

J^^ . 2 - March 19, 1984 

Oct. 10 - Dec. 17, 1983 

March -12 - May 11, 1984 



1986 

Age 9 Bridge to 1984 6,932 

Age 13 Bridge to 19S4 6,200 

Age 17 Age subsample of spiral 16,418 



3 Jan. 6 - Jan. 31, 1986 
3 Nov. 4 - Dec. 13 1985 
6 Feb. 17 - May 2, 1986 



1988 
Age 9 



Age subsample-bridge to 1984 3,782 10 
Bridge to 1986 3,711 3 



Jan. 4 - Mar. 11, 1988 
Jan. 4 - Mar. 11, 1988 



Age J.3 Age subsample-bridge to 1984 4,005 
Bridge to 1986 3,942 



10 Oct. 12 - Dec. 18, 1987 
3 Oct. 12 - Dec. 18, 1987 



Age 17 Age subsantple -bridge to 1984 3.652 
Age subsample-bridge to 1986 3,715 



10 March 14 - May 13, 1988 
6 March 14 - May 13, 1988 



*Age definitions for these samples are consistent with 1984 definitions, 
**Tlie number of blocks that include at least one reading scale item. 



64 



Some of the main differences between Instmments and procedures for the 
1984 and 1986 assessments were these: 



• Reading was accompanied by writing in 1984 and by mathematics and 
science .(and, at age 17, by computer science, history, and 
literature) in the 1986 trend samples. 



• The composition cf reading item blocks was not the same in 1984 
and 1986. Therefore, items that appeared in both years .did not 
necessarily appear in the same order or context, nor was the time 
allowed per item assured to be the same. 



• In 1984, students responded to items by circling the letter of the 
correct response, whereas in 1986, students responded by filling 
in an oval. 



Further detail on the differences between the two assessments appears in 
Chapters 2 and 3. 

Like the 1984 assessment, the 1988 -bridge to 1984 included reading and 
writing blocks. At each age, this bridge consisted of six of the 1984 
booklets that contained at least one scaled reading block. (The 1984 balanced 
incomplete block [BIB] assessment included 57 such booklets at age 9 and 56 
such booklets at ages 13 and 17.) The six bridge booklets included 10 of the 
12 reading blocks scaled in 1984. 

Like the 1986 assessment, the bridge to 1986 included reading, 
mathematics, and science blocks. At ages 9 and 13, this bridge contained all 

65 



ERIC 



74 



three booklets and all three reading blocks that were used for the estimation 
ol trend in 1986. At age 17, the bridge to 1986 included only six of the 35 
booklets from 1986 that had at least one reading block, but these bridge 
booklets contained all six 1986 reading blocks. 2 

Tables and P-f p;iirp.fi Used in the Three Bridge Comparisons 

The three bridge comparisons described in the following section are 
based on data displayed in Tables 5.2 to 5.9 and Figures 5.1 to 5.4. 

Table 5,2 gives the mean percents correct for the two 1988 bridge 
samples and for the 1984 and 1986 assessments . These means are based on all 
multiple-choice items that were included in both bridge samples. ^ Standard 
errors obtained through jackknifing (see E. G. Johnson, Burke, Braden, Hansen, 
Lago, & Tepping, 1988) are given in parentheses. 



2The bridge to 1984 included booklets 16, 17, 27, 34, 55, and 60 at age 9 
and booklets 13 16, 17, 21, 34, and 57 at ages 13 and 17 (see J. R. Johnson, 
1987, pp. 120-121). The bridges to 1986 included booklets 1-3 at ages 9 and 13 
and booklets 14, 36, 47, 62, 68, and 81 at age 17 (see Beaton, 1988a, pp. 421- 
423). At ages 9 and 13, each booklet in the bridge to 1986 included one block 
each of reading, math, and science. At age 17, two booklets contained two blocks 
of math and one block of reading, two booklets contained two blocks of science 
and one block of reading, one booklet contained one block each of reading, math, 
and science, and one booklet contained three reading blocks. Although some 1986 
age 17 booklets combined reading with computer competence or with history and 
literature, these booklets were not included in i;he 1988 bridge to 1986. 

^Percent correct is defined as R/(R + W + 0 + DK) , where R, W, 0, and DK 
represent the sum of the student weights for those who got the item right, those 
who got the item wrong, those who reached the item but omitted it, and those who 
indicated that they did not know the answer, respectively. Students who did 
not reach the item are not included in the computation. Note that the percents 
correct that appear in portions of The NAEP 1985-86 Reading Anomaly: A Technical 
Report (Beaton, 1988b) were computed using NAEP's earlier definition of the 
proportion correct, R/(R + W + DK) . The change in definitions has very little 
impact- on the reported results. Also, note tha>;, in the 1988 report, a larger 
set of items was analyzed and that the 1984 results for ages 9 and 13 were based 
on sampling weights that have now been modified (Appendix C) . 

66 



75 



Table 5.2 

Mean Percents Correct with Standard Errors for 1988 Bridges to 1984 
and 1986 and for 1984 and 1986 Assessments* 



1984 1988 Bridge 1986 1988 Bridge 

Assessment to 1984 Assessment to 1986 



Age 9 60.0 (0.6) 62.2 (1.0) 59.3 (0.9) 62.1 (0.8) 
(26 items) 

Age 13 63.1 (0.4) 63.8 (0.7) 64.4 (0.7) 65.9 (0.4) 
(19 items) 

Age 17 75.9 (0.4) 76.6 (0.5) 73.5 (0.6) 73.8 (0.6) 
(23 items) 



*A11 multiple-choice items that were common to both bridges were used in 
this analysis. 



ERIC 



67 



'6 



Tables 5.3 - 5.8 give mean percents correct for NAEP's major reporting 
groups on these same items for the two bridge samples (Tables 5.3, 5.5, and 
5.7) and for the 1984 and 1986 assessments (Tables 5.4, 5.6, and 5.8) . The 
coltimn labeled ''N" gives the average niimber of students responding to each 
item. In addition, these tables include, for each sample, the average across 
items of the percent of st ^enfs who did not reach the item. Differences 
between the two bridge saaiples and between 1986 and 1984 in percents correct 
and percents not reached are also given* 

Table 5.9 gives means and standard deviations of reading scale values 
for these same samples of students, using the metric of the 1984 reading 
scale. Standard errors of means are giv.,n in parentheses. These results are 
based on NAEP plausible values technology (see Mislevy, 1988), which is a 
method of estimating proficiency distributions based oi students' item 
responses and background characteristics (referred to in this context as 
conditioning variables). The analyses that produced the results in Table 5.9 
included six conditioning variables: gender, ethnicity, size and tjrpe of 
community (STOC) , region, parents' education, and TV watching.^ 



The coding for these conditioning variables was the same as that given in 
Mislevy, 1988 (p. 198). Tae estimated coefficients of the conditioning variables 
for the two bridge samples appear in Table D.l in Appendix D. Note that the 
results reported in Table 5.9 for the 1988 bridge to 1984 are not identical to 
those reported in The Reading Report Card, 1971 to 1988 (Mullis & Jenkins, 1990) 
and in Tables 4.1, 6.2, 6.3, and 6.4 and Figures 1.2 and 6.1. For purposes of 
trend reporting, a more complete set of conditioning variables was used in order 
to maximize comparability wich the 1984 assessment. The results reported here 
maximize the comparability between the two sets of 1988 bridge results . Appendix 
A (p. 171) gives a summary of which modifications and adjustments of reading 
scale results are used in the tables and figures in this report. 

68 



77 



Table 5.3 



NAEP 1988 Reading Bridges: Age 9 
Weighted Mean Percents Correct and Percents Not Reached 
for 26 Multiple-choice Items Common Between Bridges* 



SUBGROUP 



BRIDGE TO 1984 

I CORRECT X NOT RCH 



BRIDGE TO 1986 

X CORRECT X NOT RCH 



DIFFERENCE 
1986 - 1984 



X CORRECT 



X NOT RCH 



■~ TOTAL — " 


598 


62.2 


( 


1.0) 


5.5 


1135 


62.1 


( 


0.8) 


8.4 


0.0 


( 


1.2) 




SEX 


























1.6) 


3.5 


MALE 


295 


60.1 


( 


1.3) 


5.7 


559 


60.5 


( 


0.9) 


9.2 


0.4 


( 


FEMALE 


303 


64.2 


( 


0.9) 


5.5 


575 


63.6 


( 


1.1) 


7.6 


-0.6 


( 


1.4) ' 


2.1 


ETHNICITY 
























( 


1.4) 


3.2 


WHITE 


361 


65.5 


( 


1.1) 


4.2 


689 


66.1 


( 


0.9) 


7.4 


0.6 


BLACK 


104 


51.5 


( 


2.2) 


10.6 


175 


50.4 


( 


1.4) 


11.2 


-1.1 


( 


2.6) 


0.6 


HISPANIC 


108 


51.4 


( 


2.3) 


8.0 


216 


48.9 


( 


1.9) 


11.5 


-2.5 


( 


3.0) 


3.5 


OTHER 


26 


66.6 


( 


2.4) 


4.9 


54 


65.1 


( 


2.7) 


7.6 


-1.6 


( 


3.6) 


2.7 


REGION 


























2.0) 


-0.6 


NORTHEAST 


149 


63.7 


( 


1.4) 


7.9 


295 


63.6 


( 


1.5) 


7.3 


-0.1 


f 


SOUTHEAST 


159 


59.8 


( 


2.4) 


6.0 


316 


59.9 


( 


1.8) 


9.4 


0.3 


( 


3.0) 


3.4 


CENTRAL 


128 


64.9 


( 


1.6) 


4.3 


245 


63.2 


( 


1.5) 


8.1 


-1 . 


( 


2.2) 


3.8 


WEST 


162 


60.7 


( 


2.0) 


4.3 


278 


62.0 


( 


1.5) 


8.6 


1.3 


( 


2.5) 


4.3 


PARENTAL EDUCATION 






















-3.8 


( 


3.6) 


-1.2 


LESS THAN H.S. 


25 


53.4 


( 


3.1) 


7.7 


43 


49.5 


( 


1.8) 


6.6 


GRADUATED H.S. 


90 


61.3 


( 


1.5) 


6.3 


163 


60.9 


( 


1.5) 


8.6 


-0.3 


( 


Z.l) 


2.3 


POST H.S. 


30 


63.9 


( 


3.5) 


5.0 


87 


67.2 


( 


2.X) 


7.6 


3.3 


( 


4.1) 


2.6 


GRADUATED COLLEGE 


245 


67.9 


( 


1.2) 


4.5 


490 


68. Q 


( 


0.9) 


6.3 


0.0 


( 


1.5) 


1.9 


UNKNOWN 


207 


56.8 


( 


1.5) 


5.9 


342 


54.6 


( 


1.1) 


11.3 


-2.3 


( 


1.8) 


5.4 



•standard errors are given in parentheses. The "N" column gives the average number o£ students responding to 
ea^h item. Because of rounding, the N's for subgroups may not sum to the H for the total group. 



ERIC 



78 



Table 5.4 



NAEP 1984 and 1986 Reading Assessments: Age 9 
Weighted Mean Percents Correct and Percents Not Reached 
for 26 Multiple-choice Items Common Between Bridges* 



SUBGROUP 

— TOTAL ' - 

SEX 
MALE 
FEMALE 

ETHNICITY 
WHITE 
BLACK 
HISPANIC 
OTHER 

REGION 
NOKXHEAST 
SOUTHEAST 
CENTRAL 
WEST 

PARENTAL EDUCATION 
LESS THAN H.S. 
GRADUATED H.S. 
POST H.S. 
GRADUATED COLLEGE 
UNKNOWN 



1984 ASSESSMENT 

X CORRECT Z NOT RCH 



1972 


60.0 


< 


0.6) 


6.5 


2102 


998 


57.3 


< 


0.6) 


6.7 


1049 


974 


62.8 


< 


0.7) 


6.2 


1053 


1338 


64.0 


< 


0.7) 


5.6 


1386 


281 


46.9 


< 


1.1) 


9.8 


249 


264 


49.3 


< 


0.9) 


8.0 


307 


89 


61.5 


< 


1.8) 


5.9 


160 



446 


62.2 


< 


1.6) 


6.1 


524 


489 


57.7 


< 


1.1) 


7.2 


476 


567 


62.5 


< 


1.5) 


6.3 


53 7 


470 


57.8 


< 


0.7) 


6.4 


585 


115 


49.6 


< 


1.9) 


8.1 


88 


372 


59.2 


< 


0.8) 


6.3 


321 


99 


60.2 


< 


1.9) 


5.4 


146 


638 


68.0 


< 


0.8) 


4.3 


822 


72? 


55.9 


< 


0.8) 


7.6 


721 



1986 ASSESShENT 
X CORRECT X NOT RCH 
59.3 ( 0.9) 



56.6 ( 0.9) 
61.8 ( 1.0) 



63.4 ( 0.9) 

46.5 < 1.0) 
46.5 ( 1.6) 
58.5 < 2.7) 



61.5 < 2.2) 
55.7 < 1.8) 
61.5 ( 1.8) 
58.0 ( 1.8) 



48.0 < 1.6) 

55.2 < 1.0) 

66.1 ( 1.2) 

66.3 ( 0.9) 
53.1 < 1.1) 



DIFFERENCE 
1986-1984 

X CORRECT X NOT RCH 



9.3 


-0.8 


< 


1.1) 


2.8 


10.1 


-0.7 


< 


1.1) 


3.4 


8.6 


-1.0 


< 


1.2) 


2.3 


8.6 


-0.6 


< 


1.2) 


3.0 


12.0 


-0.4 


< 


1.5) 


2.2 


10.6 


-2.8 


< 


1.8) 


2.6 


10.2 


-2.9 


< 


3.?) 


4.3 


8.6 


-0.7 


< 


2.7) 


2.5 


9.5 


-2.0 


< 


2.1) 


2.3 


9.2 


-1.1 


< 


2.3) 


2.9 


9.9 


0.2 


/ 


2.0) 


3.6 


il.l 


-1.7 


< 


2.5) 


3.0 


9.9 ' 


-4.0 


< 


1.3) 


3.6 


6.4 


6.0 


< 


2.3) 


1.0 


7.5 


-1.7 


< 


1.2) 


3.3 


11.3 


-2.8 


< 


1.4) 


3.7 



eacn item. Because of rounding, the N's for subgroups may not sum to the N for the total group. 



79 



Table 5.5 



NAEP 1988 Reading Bridges: Age 13 
Weighted Mean Percents Correct and Percents Not Reached 
for 19 Multiple-choice Items Common Between Bridges* 



SUBGROUP 



BRIDGE TO 1984 
X COHRECT X NOT RCH 



BRIDGE TO 1986 

X CORRECT X NOT RCH 



DIFFERENCE 
1986 - 1984 

X CORRECT % NOT RCH 



— TOTAL — 


657 


63.8 


/ 

V 


0.7) 


0.8 


1280 


65.9 


( 


0.4) 


4.b 


2.1 


( 


0.8) 


4.0 
































MALE 


320 


61.5 


( 


0.8) 


1.1 


636 


64.3 


( 


0.6) 


5.7 


2.8 


( 


1.0) 


4.7 


FEMALE 


336 


66.0 


( 


0.8) 


0.6 


644 


67.5 


( 


0.6) 


3.9 


1.5 


( 


1.0) 


3.3 


ETBNICITY 






























WHITE 


478 


65.8 


( 


0.8) 


0.2 


904 


68.7 


( 


0.4) 


3.1 


2.9 


( 


0.9) 


2.9 


BLACK 


92 


58.9 


( 


1.5) 


2.9 


196 


59.2 


( 


0.7) 


10.3 


0.3 


( 


1.6) 


7.4 


HISPANIC 


58 


55.6 


( 


2.1) 


1.7 


125 


55.0 


( 


1.8) 


9.2 


-0.6 


( 


2.7) 


7.4 


OTHER 


29 


66.8 


( 


2.9) 


1.6 


54 


67.3 


( 


2.0) 


4.0 


0.5 


( 


3.6) 


2.5 


REGION 






























NORTHEAST 


144 


64.9 


( 


1.8) 


0.1 


2d7 


68.0 


( 


0.9) 


5.3 


3.1 


( 


2.0) 


5.2 


SOUTHEAST 


139 


'63.8 


( 


1.3) 


1.2 


256 


65.8 


( 


1.3) 


7.2 


1.9 


( 


1.8) 


o.O 


CENTRAL 


196 


62.6 


( 


1.5) 


1.4 


368 


65.3 


( 


1.2) 


3.2 


2.7 


< 


1.9) 


1.9 


WEST 


177 


63.9 


( 


1.4) 


0.6 


369 


64.9 


< 


0.7) 


3.7 


1.0 


( 


1.6) 


3.1 


PARENTAL EDUCATION 






























LESS THAN H.S. 


45 


57.9 


( 


1.8) 


0.8 


88 


58.9 


( 


2.2) 


9.1 


1.0 


( 


2.9) 


8.3 


GRADUATED H.S. 


213 


61. 9 


( 


1.0) 


0.7 


321 


61.8 


( 


0.6) 


5.5 


-0.2 


( 


1.1) 


4.7 


POST H.S. 


68 


66.3 


( 


1.8) 


0.4 


208 


67.8 


( 


1.0) 


2.5 


1.5 


< 


2.0) 


2.2 


GRADUATED COLLEGE 


271 


67.7 


( 


0.9) 


0.5 


559 


69.9 


< 


0.5) 


3.2 


2.2 


< 


1.0) 


2.8 


UNKNOWN 


58 


55.0 


( 


1.9) 


1.9 


102 


59.9 


< 


1.8) 


10.8 


4.9 


< 


2.6) 


9.0 



*Stimdfird errors are given in parentheses. The "N" column gives the average nuiober o£ students responding to 
each item. Bec&use of rounding, the N's for subgroups may not sum to the N for the total group. 



ERIC 



SO 



Table 5,6 



NAEP 1984 and 1986 Reading Assessments: Age 13 
Weighted Mean percents Correct and Percents Not Reached 
for 19 Multiple-choice Items Common Between Bridges* 



SUBGROUP 



PARENTAL EDUCATION 
LESS THAN H.S. 
GRADUATED H.S. 
FOST H.S. 
GRADUATED COLLEGE 
UNKKOHH 



1984 ASSESSMENT 

X CORRECT X NOT RCH 



— TOTAL — 


2208 


63.1 


( 


C.4) 


2.4 


SEX 












KALE 


1108 


61.3 


( 


0.6) 


3.0 


FEMALE 


1100 


65.1 


( 


0.6) 


1.8 


ETHNICm 












WHITE 


1599 


65.8 


( 


0,5) 


1.7 


BLACK 


29A 


53.5 


( 


1.?.) 


5.4 


HISFAHIC 


2A1 


54.8 


( 


1.5) 


4.5 


OTHER 


7A 


62.7 


( 


2.4) 


2.2 


P^ION 












NORTHEAST 


493 


64.6 


( 


0.<J> 


Z.3 


SOUTHEAST 


548 


63.0 


( 


1.2) 


2.8 


CENTRAL 


638 


62,7 


( 


1.1) 


2.3 


WEST 


529 


62.5 


( 


0.6) 


2.3 



193 


54.5 




1.1) 


3.8 


779 


61.1 


< 


0.7) 


2.6 


220 


67.8 


( 


0.7) 


2.0 


792 


68.4 


( 


0.6) 


1.4 


203 


52.9 


( 


0.9) 


3.9 



N 
1911 



939 
972 



1986 ASSESSMENT 
X CORRECT X NOT RCH 
64.4 ( 0.7) 5.9 



63.5 
65.3 



0.7) 
0.9) 



7.6 
4.2 



DIFFERENCE 
1986-1984 



X CORRECT 
1.2 ( 0.8) 



2.1 ( 0.9) 
0.2 ( 1.0) 



X NOT RCH 
3.5 



4.6 
2.4 



1175 


66.9 


( 


0.8) 


4.0 


1.1 


( 


1.0) 


2.3 


418 


58.7 


( 


1.3) 


11.6 


5.2 


( 


1.7) 


6.2 


254 


52.8 


( 


2.0) 


12.9 


-2.0 


( 


2.5) 


8.4 


63 


64.4 


( 


2.8) 


2.5 


1.7 


( 


3.6) 


0.3 



475 


66.5 


( 


1.4) 


3.9 


1.9 


( 


1.6) 


1.6 


427 


64.1 


( 


1.1) 


6.0 


1.0 


( 


1.6) 


5.2 


450 


62.8 


( 


2.2) 


6.0 


0.2 


( 


2.5) 


3.7 


559 


64.3 


( 


1.3) 


5.6 


1.8 




1.4) 


3.3 



152 


57.8 


( 


1.6) 


9.9 


3.2 


( 


2.0) 


6.1 


545 


82.3 


( 


0.8) 


6.4 


1.2 


( 


1.1) 


3.8 


296 


67.2 


( 


1.0) 


3.6 


-0.6 


( 


X.3) 


1.7 




66.5 


( 


0.7) 


4.4 


0.1 


( 


0.9) 


2.9 


159 


52.1 


( 


2.3) 


12.4 


-O.S 


( 


2.5) 


8.5 



^standard errort; are given in parentheses. The "N" column gives the average nuaber of students responding to 
each item. Because of rounding, the N'a for subgroups may not sum to the N for the total group. 



Table 5.7 



NAEP 1988 Reading Bridges: Age 17 
Weighted Mean Percents Correct and Percents Not Reached 
for 23 Multiple -choice Items Common Between Bridges* 



BRIDGE TO igSA 



BRIDGE TO 1S86 



DIFFERENCE 
1986 - 198« 



SUBGROUP 



.2 CORRECT X NOT RCH 



X CORRECT 



X NOT RCH 



X CORRECT X NOT RCH 



— TOYAL — 


60A 


76.6 


( 


0.5) 


0.5 


867 


73.8 


( 


0.6) 


2.8 


-2.8 


( 


0.8) 


2.2 


SEX 






























KALE 


277 


7A.C 


( 


0.8) 


0.6 


A12 


71.1 


( 


1.1) 


3.5 


-3.2 


( 


l.A) 


2.9 


FEMALE 


327 


78.6 


( 


0.7) 


;.A 


A55 


76.6 


( 


0.7) 


2.0 


-2.0 


( 


1.0) 


1.6 


ETHNICITY 






























WHITE 


425 


78.8 


( 


0.5) 


0.3 


639 


76.2 


( 


0.6) 


l.C 


-2.6 


( 


0.8) 


1.5 


BLACK 


106 


71. A 


( 


l.A) 


0.3 


133 


65.5 


( 


1.9) 


A. 7 


-6.0 


( 


2. A) 


A. 3 


HISPANIC 


A8 


66.0 


( 


1.6) 


l.A 


66 


65.5 


( 


2.5) 


A. 8 


-0.5 


( 


2.9) 


3. A 


OTHER 


26 


78.2 


( 


2.7) 


2.3 


29 


76.6 


( 


2.9) 


7.7 


-1.6 


( 


3.9) 


5.5 


REGION 






























NORTHEAST 


131 


79.1 


( 


1.2) 


0.5 


203 


75.5 


( 


1.3) 


2.6 


-3.6 


( 


1.8) 


2.1 


SOUTHEAST 


155 


7A.6 


( 


1.1) 


0.3 


221 


73.8 


( 


1.6) 


2.0 


-0.9 


( 


1.9) 


1.7 


CENTRAL 


106 


77.3 


{ 


0.7) 


0.1 


156 


73.7 


( 


1.0) 


2.7 


-3.7 


( 


1.2) 


2.6 


WEST 


212 


75. A 


t 


1.1) 


1.1 


287 


72.6 


( 


1.1) 


3.5 


-2.7 


( 


1.5) 


2.A 


PARENTAL EDUCATION 






























LESS THAN H.S. 


52 


69.6 


( 


1.7) 


O.A 


69 


63.3 


( 


1.9) 


2.9 


-6.3 


( 


2.6) 


2.5 


GRADUATED H.S. 


175 


/M.3 


( 


0.8) 


0.3 


197 


70.0 


( 


1.2) 


2. A 


-A. 3 


( 


l.A) 


2.x 


POST H S 


103 


78.8 


( 


1.3) 


0.8 


207 


75.2 


( 


1.1) 


3. A 


-3.7 


C 


1.7) 


2.6 


GRADUATED COLLEGE 


257 


79.9 


( 


0.8) 


0.3 


371 


79.3 


( 


0.8) 


1.5 


-0.7 


( 


1.1) 


1.3 


UNTOtOWN 


16 


61. A 


( 


2.7) 


2,6 


21 


58.0 


( 


A. 7) 


A. 3 


-3. A 


( 


5.A) 


1.7 



*Standard errors are given in parentheses. The "N" column glvds the average number of sn,^deiitk» rssponding to 
each item. Because o£ rounding, the N's for subgroups may not sum to the H for the total group. 



Table 5.8 



NAF.P 1984 and 1986 Reading Assessments: Age 17 
Weighted Mean Percents Correct and Percents Not Reached 
for 23 Multiple -choice Items Common Between Bridges.* 



1984 ASSESSMENT 



1935 ASSESSMENT 



DIFFERENCE 
1986 - 198A 



SUBGROUP 

— TOTAL — 

SEX 
MALE 
FEMALE 

ETHNICITY 
WHITE 
BLACK 
HISPANIC 
OTHER 

REGION 
NORTHEAST 
SOUTHEAST 
CENTRAL 
WEST 

PAREirrAL EDUCATION 
LESS THAN H.S. 
GRADUATED H.S. 
POST H.S. 
GRADUATED COLLEGE 
UNKNOWN 



N 


X CORRECT 


Z NOT RCH 


N 


Z CORRECT 


Z NOT RCH 


Z CORRECT 


Z NOT ] 


2390 


75.9 


( 0.4) 


1.8 


1901 


73.5 


( 0.6) 


2.5 


-2.4 


( 0.8) 


0.7 


1191 


73.8 


( 0.6) 


2.4 


954 


7i».9 


( 0.8) 


2.8 


-2.9 


( 1.0) 


0.4 


1198 


78.2 


( 0.4) 


1.2 


947 


7t.2 


( 0.8) 


2.2 


-:>.o 


( 0.9) 


1.0 


1745 


78.2 


( 0.5) 


1.2 


1341 


76.2 


( 0.7) 


1.6 


-2.0 


( 0.8) 


0.4 


339 


67.7 


( 0.9) 


3.4 


319 


65.2 


( 0.9) 


4.9 


-2.5 


( 1.3) 


1.5 


225 


68.8 


( 1.7) 


4.1 


192 


62.4 


( 1.5) 


6.4 


-6.4 


( 2.2) 


2.4 


80 


73.7 


( 2.1) 


3.4 


48 


66.4 


t 3.1) 


4.4 


-7.4 


( 3.7) 


1.0 


536 


77.1 


( 1.7) 


1.1 


383 


75.7 


( 1.1) 


2.3 


-1.5 


< 2.0) 


1.2 


601 


75.2 


( 0.7) 


1.6 


487 


70.6 


( 0.9) 


2.4 


-4.5 


( 1.1) 


0.8 


680 


75.8 


C 0.9) 


1.8 


492 


74.6 


( 1.5) 


1.7 


-1.2 


( 1.7) 


-0.1 


573 


75.6 


( 0.6) 


2.8 


539 


72.6 


( 1.2) 


3.7 


-3.0 


( 1.3) 


0.3 


287 


69.5 ( 


0.9) 


2.7 


164 


64.0 ( 


1.5) 


4.7 


-5.4 ( 


1 7) 


1.9 


846 


73.0 ( 


0.6) 


1.8 


52e 


c9.7 ( 


0.5) 


2.4 


-3.4 ( 


0.8) 


0.6 


347 


78.8 ( 


0.5) 


1.3 


427 


75.9 ( 


0.9) 


1.8 


-3.0 < 


1.0) 


0.5 


809 


80.9 ( 


C 6) 


1.3 


703 


78.7 ( 


0.8) 


1.9 


-2.1 ( 


1.1) 


0.5 


77 


60.4 ( 


1.3) 


6.1 


69 


56.1 ( 


2.2) 


5.7 


-4.4 ( 


2.6) 


-0.4 



^Standard errors are given in parentheses. The "N" column gives the average number of students reuponding to 
each item. Because of rounding, the N's for subgroups may not sum tv the N for the total group. 



83 



Table 5.9 



Reading Scale Means and Standard Deviations for 1988 Bridges 
to 1984 and 1986 and for 1984 and 1986 Assessments* 



198^ Assessment 1988 Bridge to 198A 1986 Assessment 1988 Bridge to 1986 









N o£ 








N ot 






N of 






N o£ 


Mean 


SB 


SD 


Items 


Mean 


SE 


SD 


Items 


Mean 


SE 


SD 


Items 


Mean 


SE 


SD Items 


211.0' 


(1.0) 


41.1 


126 


212.1 


(1.1) 


40.2 


99 


208.9 


(1.2) 


39.6 


31 


214.0 


(1.0) 


40.9 30 


257.1 


(0.7) 


35.5 


124 


257.5 


(0.9) 


33.9 


99 


259.4 


(1.0) 


35.7 


25 


263.7 


(0.8) 


37.1 24 


288.8 


(0.9) 


40.3 


113 


289.9 


(1.3) 


37.4 


87 


PJ7.4 


(1.1)** 


49.4 


62 


281.9 


(1.4) 


46.9 58 



* All results are based on NAEP plausible values technology. Standard errors of means are given in parentheses. Appendix A 
(p. 171) gives a suoxsacy o£ which modifications and adjustments of reading scale results are used in the tables and figures in this 
report. 

** Standord error differs from column 4 of Table 4.1 because of & change in Jackknife methodology. 



84 



er|c 



All multiple-choice items that appeared in the bridges were included in 
these analyses; for the bridge to 1984, the number of items per cohort ranged 
from 87 to 99-«substantially larger than in the percents correct analysis.^ 
The item parameters used in these bridge sample analyses, which are the same 
as those used in the corresponding assessment years, are listed in the 1984 
and 1986 technical reports (Beaton, 1987 and Beaton, 1988b) for the bridge to 
1984 and the bridge to 1986, respectively. Chapter f of this report includes 
a description of the procedures used originally to estimate the 1984 and 1986 
item parameters. 

Because of improvements in estimation procedures noted in Chapter 4, 
some of the results in Table 5.9 for the 1984 and 1986 assessments differ from 
those that appeared in The NAEP 1985-86 Reading Anomaly: A Technical Report 
(Beaton, 1988a). The 1984 scale means for ages 9 and 13 are lower than the 
previously reported results by roughly two po'^^its and one point, respectively, 
because of the adjustments to the 1984 student weights (Appendix C) . The 1986 
results at ages 9 and 13 differ from those previously reported because of the 
correction of a specification error in the conditioning procedures in 1986.^ 



^Professionally scored items were excluded from the bridge scales so that 
investigation of the reading anomaly would not be complicated by changes in 
scoring patterns for these items. The exclusion of the professionally scored 
items frrrn the bridge scaling accounts for the difference between 1986 and the 
1988 bridge to 1986 in the number of items scaled. The number of items in the 
bridge to 1984 is substantially less than the number of items in the 1984 
assessment because only a subset of the 1984 booklets was used in this bridge. 

®At ages 9 and 13, the 1986 reading assessment included both a balanced 
incomplete block (BIB) spiral component and a bridge to 1984. For each of these 
two cohorts, the BIB and bridge samples wers combined for purposes of generating 
plausible values. In estimating the conditional distributions, an indicator 
variable for sample membership (BIB or bridge) \jas included among the 
conditioning variables (see Mislevy, 1988). Also included was a variable that 
reflected whether students were above, at, or below modal grade. However, the 
modal grade was not the same for the BIB and bridge /samples. The conditioning 
model was mis -specif ied in that it did not allow for an interaction between this 

76 



85 



Neither the weight adjustment for 1984 results, nor the correction to the 
estimation procedure in 1986 affected the results for age 17, where the 
anomaly was the most pronounced. 

Figure 5.1 displays the NAEP reading scale results since 1971. 
Figures 5,2 - 5.4 show c erlay graphs of the estimated distributions for 
the two 1988 bridge samples at each age level. ^ 

!• Comparison of the 1988 Bridge Samples 

The main findings of the comparison of the 1988 bridges to 1984 with the 
1988 bridges to 1986 are these: 

• At age 9, the mean percent correct for the two bridges 
TsTas the same; the reading scale tuean for the bridge to 
1986 was slightly higher. 

• At age 13, the performance of the bridge to 1986 was 
superior both in terms of mean percents correct and in 
terms of scale means. 

• At age 17, th^ potformance of the bridge to 1986 was 
inferior , both in terms of percents correct and in 
terms of scale means. 



variable and the sample membership variable. (Since the age 17 assessment 
included only a BIB compoiient-, estimation of the age 17 results was not 
affected.) The problem was corrected by conditioning the BIB and bridge samples 
separately. The effect on the BIB results that are reported in Who Reads Best? 
(Applebca, Langer, & Mullis, 1988) was almost nil; the effect on the bridges, 
more substantial. The corrected mean for age 9 in ^,6 points higher than the 
previously reported value, whereas the corcected mean for age 13 is one poin*- 
lower. At both ages, the standard deviations are about 5 puints lower; unlike 
the earlier estimates, they do not appear large relative to 1984. 



^Figures 5.1 to 5.4 are identical to Figures 4.1 to 4.4. They are 

reproduced here for convenience. These figures are based on all five sets of 
plausible values. 

77 



ERIC 



Figure 5.1 

Reading Scale Results 
1971 - 1988* 



500 
3C0- 



290 

280 

270 H 

260 

250 

240 

230- 

220 

210 

200 



Age T7 

1— 



Age 13 
i 



-I 
— ^ 



Age 9 



■4 



Br. lo 84 
Br. lo 86 

Br. lo 86 
Br. lo 84 



Br. lo 86 
Br. lo 84 



1971 



1975 



1980 



1984 



1986 



1988 



* Stacdard errors of means are approximately 1.0. Bands extendi from two standard 
errors below to two standard errors above the mean. App.tndix A (p. i:i) gives a sunmary of 
which modifications and adjustments of reading scale results are usei in the tables and 
figures in this report. 



Weighted Reading Frox 



Year 




Am (S.E.) 


A«e_l3 (S.E.) 


ARe 17 (S.E.) 


1971 




207.3 (1.0) 


255.2 (0.9) 


285.4 (1.2) 


1975 




210.2 (0.7) 


256.U (0.8) 


286.1 (0.8) 


1080 




21A.8 (1.1) 


258.5 (0.9) 


285.8 (1.4) 


198A 




211.0 (1.0) 


257.1 (0.7) 


288.8 (O.b) 


1986 




208.9 (1.2) 


259. A (1.0) 


277.4 (1.1)** 


1988 Br. 


to 84 


212.1 (1.1) 


257.5 (0.9) 


289.9 (1.3) 


1988 Br. 


to 86 


21A.0 (1.0) 


263.V (0.8) 


281.9 (1.4) 



** standard error differs from ';olumn 4 of Table 4.1 because of a change in 
Jacklcnife methodology. 



78 

87 



Figure 5.2 

Estimated Reading Proficiency Distributions for 1988 Bridge Samples, Age 9 



SOL I D 
DASH e D 



BRIDGE TO 133'^ 
BR I DGE TO 1 9 B 6 




Heading Proficiency Scale 



79 



88 



Figure 5.3 

Estimated Reading Proficiency Distributions for 1988 Bridge Samples , Age 13 




Figure 5.4 " 

Estimated Reading Proficiency Distributions for 1988 Bridge Samples, Age 17 



SOL i D 
OA SHED 



B P I DGE Tx 1 9 t3< 
B P I DGE TO 1 9 8 C 



P 

r 
o 

P 
o 
r 
t 
i 
o 
n 

of 

P 
o 

P 
u 

1 

Q 
t 
i 
O 

n 



P 
d 
r 

10- 

P 
o 
i 
n 
t 

I 
r 
t 
e 
r 

V 

a 
1 




Reading Proficiency Scale 



81 



90 



ERIC 



At all three ages, the variance of the scale distributions was greater 
for the bridge to 1986 than for the bridge to 1984. The difference is 
particularly notable at age 17. The large difference in variances at age 17 
is similar to that observed in comparing the actual 1984 and 1986 a.^sessments . 
The graphs of distributions in Figures 5.2 through 5.4 show that the bridge to 
1986 is characterized by a heavier upper tail than the bridge to 1984 at age 
13 and heavier upper and lower talis at age 17. These findings parallel those 
obtained by comparing the 1984 and 1986 assessments. (Graphs of the 1984 and 
1986 proficiency distributions' are included in Chapter 6.) 

There is some evidence of differential effects acro.«;s subgroups. For 
example, there is a tendency for lower- scoring groups at age 17 to be more 
disad^'antaged by the bridge to 1986 conditions than higher-scoring groups. 
This result parallels to some degree the bindings in the actual 1984 and 1986 
assessments . 

An >)dditional finding at age 9, which probably accounts for the slight 
inconsistency between the percents correct and the scale value results, is 
evidence cf speededness of both the 1984 and 1986 instnjunents. On 18 of 26 
items, the percentage of students who did not reach the item exceeded 10 in at 
least one of the bridge samples. These results parallel those obtained in the 
actual 1984 and 1986 assessments. Because the items were ordered differently 
in the two bridges (reflecting the corresponding prior assessments), the 
speededness tends to affect different items in each bridge. The median of the 
absolute differences between the bridges in not- reached percentages was 11.5; 
tJie corresponding value for the 1984 and 1986 assessments was 11.0. 



82 

91 

ERIC 

Hjr — 



Not-reached items are treated in different ways in the percents -correct 
analysis and the scaling analysis. In NAEP, the percent correct on an item is 
based only on students who reached the item. The implicit assumption is that 
students who do not reach the items have the same probability of answering 
correctly as students who do >-each the items. In fact, the relation betw63n 
percent not reached and percent correct is not a simple one: On about one- 
third of the age 9 items, the bridge sample with the greater not-reached 
percent had a higher percent correct on the item. In scaling, the treatment 
of not-reached items is consistent with the assumption that students who do 
not reach an item have the same probability of answering correctly as other 
students who did reach the item and who gave the same responses to the 
preceding items and tiu,. the same values on the conditioning variables . 

In summary, there is evidence of difterences between the 1984 and 1986 
forms and administration procedures, but the effects are not the same at all 
three ages. The lack of consistency across ages is TxOc surprising in light of 
the fact that changes in forms between 1984 ana 1986, including, for example, 
the degree to which item positions were shifted, were not uniform across the 
age groups . The effects of these kinds of changes are discussed further in 
Chapters 6 and 8. 

2. Comparison of *:he 1984 Assessment with the 1988 Bridge to 1984 

The findings of the comparisc.i of the reading results from the 1984 
assessment to those from the corresponding 1988 bridge show slight evidence of 
an increase at age 9. There is little or no change, in terms of either mean 
percents correct or scale means at ages 13 and 17. The standard deviations of 
the scale values are also quite similar. 

83 




3. Comparison of the 1986 Assessment with the 1988 Bridge to 1986 

Whereas the comparison of reading scale means for 1984 to those in the 
1988 bridge to 1984 suggests that reading achievement has remained quite 
stable during the last four years, comparison of the 1986 assessment to its 
corresponding 1988 bridge reveals sizeable increases in scale means. The 
difference is largest at age 9, where the bridge mean exceeds the 1986 mean by 
5.1 scale points. (The corresponding change of three points in mean percent 
correct is also very large.) In considering these differences, it is 
important to note that, for reading assessments that occurred between 1971 and 
1984, the largest change between successive assessments was 4.6 scale points. 
This change took place in a five-year interval (1975 to 1980) at age 9 (see 
Beaton, 1988a, p. 7). This makes a five-point change in two years appear 
unlikely. 

Tb-re ar< two ways in which the bridge to 1986 is known to differ from 
the 1986 assessment. First, because of the relatively small size of the 
bridge samples, it was not possible to re-create the large assessment sessions 
that occurred in some instances in the 1986 assessment of 17-year-olds. 
However, this explanation seems unlikely to account for the relatively steep 
rise between 1986 and 1988, particularly since a small investigation of 
effects of session size in the 1986 assessment showed a slight tendency for 
medium sessions (26-50 students) to have lower results than small (1-25 
students) or large (more than 50 students) sessions. 

The other known difference between 1986 and 1988 is that, as shown in 
Table 5.1, the times of testing for the bridge to 1986 were slightly different 
from the time of testing for the 1986 assessment. This occurred because it 

84 

o S3 
ERIC 



was desirable to match the times of testing for the two sets of 1988 bridge 
samples. The 1988 main NAEP assessment affords an opportunity to assess the 
effects of time of testing, since two random half- samples were assessed at 
each age, one in the winter and one in the spring. On the average, the spring 
samples were tested about two months later than the winter saiuples. At age 9, 
which displayed the largest difference between the half -samples, the mean 
reading proficiency increased 2 scale points between winter and spring, with a 
standard error of 2.1. The difference at age 13 was 0, with a standard error 
of 2.1, and at age 17, there was a drop of 0.8 points between winter and 
spring, with a standard error of 1.8. Clearly, wi thin-year changes were not 
large. Interpolations based on these lesult'j suggest that the changes in time 
of testing between 1986 and 1988 would have had little impact. 

An additional hypothesis that was considered was that teaching to the 
test may have occurred in schools that were included in both the 1986 and the 
1988 bridge to 1986 assessments. However, an investigation showed that only 
two schools were included in both assessments. It is therefore implausible 
that teaching to the test could have contributed to the large gains between 
1986 and 1988. 

The large differences between the 1986 assessment and the 1988 bridge to 
1986, which are paralleled in the mathematics and science results (see Chapter 
7) suggest that there were aspects of the 1986 assessment that were not 
duplicated in the 1988 bridge to 1986. 



ERIC 




Sxjunmarv 

The comparison of the 1988 bridge samples yielded the following 
conclusions: 



The 1986 instruments and conditions appear to have been 
advantageous to 13-year-olds and disadvantageous to 1? -year-olds, 
relative to the 1984 assessment. 

At age 9, the percents of students who failed to reach certain 
items were substantially different in the two assessments • 

Based on the 1988 bridge to 1984, there appears to have been 
little change in reading proficiency between 1984 and 1^88. 



Despite the somewhat puzzling gain between 1986 and the 1988 bridge to 1986 at 
age 9, the findings of these bridge comparisons suggested that it would be 
useful to pursue the idea of using the bridge data to equate the 1984 and 1986 
results. The ensuing analyses are described in Chapter 6. 



ERIC 



86 



Chapter 6 

ADJUSTMENT OF THE 1986 READING RESULTS 
TO ALLOW FOR CHANGES IN ITEM ORDER AND CONTEXT 



Rebecca Zwick^ 

Qverview^ 

Although the potential effect of item context on proficiency estimation 
has been discussed in the measurement literature (see Leary & Dorans, 1985, 
for a review and Wise, Chia, & Park, 1989, for a recent example), the 
prevailing view has been that item-parameter estimates derived through item 
response theory (IRT) methods are relatively robust to changes in item 
context. Current testing practices, such as item banking and adaptive 
testing, as well as IRT common- item equating metLods, such as the one applied 
to NAEP reading data in 1986, rest on the assumption of invariance of item 
parameters across different test forms. 

The analyses described in Chapters 5, 6, and 8 of this report show that 
in the case of the 1984 and 1986 NAEP assessments, the effects of changes in 
item context, position, and administration conditions were large enough to 
produce significant differences in item functioning, which, in turn, led to a 
violation of the item-parameter invariance assumption. One manifestation of 
the impact of changes in item position on item functioning was the large 
difference between the two assessments in the percents of students reaching 

^Analysis plans for the adjustment method described in this chapter were 
developed in collaboration with Albert Beaton and Robert Mislevy. David Freund 
provided statistical prcgrammingj with assistance from ^^nhwei Wang and Kate 
Pashley. David Freund and Jc-Ling Li^.ng produced the figures in this chapter. 

87 



ERLC 



96 



certain items. Neither IRT nor any other method in existence can yield 
invariant parameters under these circtamstancec. In theory, more complex 
models could be developed that would take into account explicitly any changes 
in item position or context, but such models are unlikely to be available in 
the near future. Our findings have led us to appreciate the need to avoid 
changes in instruments and procedures when assessing trend. 

The value of the 1988 bridge data is that it made possible the equating 
of the 1984 and 1986 instruments without reliance on common- item assumptions. 
By using common-population equating, rather than common- item equating, we 
could allow for the possibility that items common to 1984 and 1986 were, in 
fact, functioning as different items in each assessment and that fom and 
administration changes may have had a different impact at each age level. 

Common-population equating is possible when two random samples from the 
^ame-.p.opulation_are_ayailahle.,.— JThe-^qiiating^s^chieved by-^matching^ certaiti — 
properties of one sample proficiency distribution (in this case, the first two 
moments) to those of the other sample distribution, as described in detail 
below. The transformation of the proficiency scale that achieves this match 
implies a set of transformations for the item parameters. In contrast, our 
attempt in 1986 to link the 1986 results to the 1984 reading scale through 
common- item equating was based on the assumption of item-parameter invar iance. 
The analyses of the bridge data reported in this chapter not only permitted us 
to investigate the impact of relaxing this invariance assumption but yielded 
new item parameters for the 1986 instrument which were used to adjust the 1986 
results. To explain how this was done, it is necessary to first describe the 
estimation of reading item parameters in 1984 and 1986, 



88 




Estimation of Peadlng Item Parameters in 1984 and 1986 

In 1984, item-parameter estimates were obtained ^simultaneously for 
reading items administered to all three age cohorts. The initial estimates of 
item parameters and reading proficiency, which were on an arbitrary scale, 
were linearly transformed to produce a reading proficiency scale with a mean 
of 250.5 and a standard deviation of 50 across all three cohorts (see Mislevy 
& Sheehan, 1987) . 

In 1986, reading items were administered to balanced incomplete block 
(BIB) spiral samples for all three cohorts and to bridge samples to 1984 at 
ages 9 and 13. Students from all five of these samples (along with students 
in a special language minority study [Baratz-Snowden, Rock, Pollack, & Wilder, 
1988]) were included in a single item calibration (see Zwick, 1988b). The 
steps for obtaining the 1986 item parameters were as follows: 




1. Assume the three -parameter logistic model, 

P(Xi - l\e) - Ci + (1 . Ci) {1 + exp [-1.7ai(tf -bi)]}-^ 
where is the response to item i, which is scored "1" if 
correct, 0 represents ability, a^ is a discrimination parameter 
for item i, b^ is a difficulty parameter, and the lower asymptote, 
C£, represents the probability that a very low-ability examinee 
answers the item correctly. Use the BILOG program (Mislevy & 
Bock, 1982) to obtain provisional parameter estimates (designated 
86-P) for all items for all three cohorts, ignoring the 
distinction between old and new items. 



89 

:ERLC 



Apply the Stocking-Lord procedure (see Sheehan, 1988) to the items 
conunon to 1984 and 1986 in order to find the best-fitting linear 
transformation for mapping the 86 -P parameters to the 1984 
parameters . 

Use these transformed parameters (designated L(86-P) to indicate a 
linear transformation of the 86-P parameters) for the items that 
were new to 1986. 

Substitute the original 1984 parameters for the items that dated 
back to 1984^. 



Th eJ.Jgg.6^bxidges-toHL'984-Sairapes 'g^nd 13 included only items that dated 
back to 1984. Therefore, the item-parameter estimates used to obtain results 
for these two cohorts were a subset of the estimates obtained in 1984. 
However, at age 17, trend results were to be based on the age 17 subsample of 
the BIB-spiraled assessment, which received 43 items from 1984 and 19 new 
items. Therefore, the age 17 results were based on 1984 parameters for the 43 
old items and L(86-P) parameters for the 19 new items. The reading results at 
age 17 may have appeared particularly anomalous because the 19 new items were 
linked into the scale using a common- item equating procedure that, we have 
since learned, was inappropriate for linking the 1984 and 1986 measures, 
despite its frequent use in similar applications outside NAEP. 

^An alternative would have been to ^ the L(86-P) parameters for these 
items as well, or to use some weighted combination of the original 1984 and 
L(86-P) parameters. In fact, we did estimate the reading mean for age 17 using 
the L(86-P) parameters for all items and found the mean to be even lower than 
the reported anomalous mean. 

90 

99 



1988 Adjustment Procedure 

To derive an adjustment of the 1986 reading results, we took advantage 
of tiie fact that, for each cohort, the 1988 bridges provided reading data for 
the 1984 and 1986 instruments and procedures based on random samples from the 
identically defined populations of 9-, 13-, and 17 -year-olds. We were, 
therefore, able to abandon coranon-item equating procedures and make use of 
common-population equating. Then, under the assumption that the relation 
between the 1984 and 1986 measu?:ement systems was stable over time, it was 
possible to use the equating functions to adjust the 1986 reading results. 
The steps in the adjustment procedure were as follows: 



1. Use the RESOLVE program (which implements the procedures described 
in Mislevy, 1984) to obtain a nonparametric estimate of the 
proficiency distribution for each cohort of the 1988 bridge to 
1986, using the 86-P parameters. (The results in Table 5.9 for 
the 1988 bridge to 1986 rely on common- item assvunptions and could 
not, therefore, serve as the basis of the equating.) 

2. For each cohort, obtain the linear transformation that matches the 
mean and standard deviation for the 1986 bridge, obtained from 
Step 1, to those of the 1984 bridge. (Note that the linear 
transformations were not constrained to be the same from one age 
to another.) 



91 

100 



3. Use these transformations to adjust the 86-P item parameters, 
yielding three new sets of item parameters, L'g(86), L' 13(86), and 

17(86) (different linear transformations at each age). These 
adjusted parameters appear in Tables D.2 through D.4 in Appendix 
D. 

4. Apply these L'9(86), L' 13(86), and L' 17(86) parameters to the 
reading data collected in 1986 to reestimate the 1986 reading 
scale values, using plausible values methodology (Mislevy, 1988), 
a method of estimating proficiency distributions based on 
students' item responses and background characteristics (referred 

to in this context as condl iZloning^varlables^^^. — The-same^ 

conditioning variables and coding scheme were used as in the 
original 1986 analysis (Mislevy, 1988, p. 198). The estimated 
coefficients for the conditioning variables appear in Tables D.5 
through D.7 in Appendix D. 

Results of the Adjustment 

The reading scale means from the adjustment procedure are shown in 
Figure 6.1^, along with earlier reading results and 1988 results based on the 
bridge to 1984. The means and standard deviations for 1984, 1986, and the 
adjusted 1986 results are also given in Table 6.1 for each cohort.^ (The 

^Tliis figure is identical to Figure 1.2. It is repeated here for 
convenience. 

^For reasons described in Chapters 4 and 5, the 1984 and 1986 results for 
ages 9 and 13 differ from previously published results. The 1984 results in 
Table 6.1 incorporate the adjustment in sample weights; the original 1986 results 
incorporate the modification in the conditioning model. Appendix A (p. 171) 
gives a summary of which modifications and adjustments of reading scale results 
are used in the tables and figures in this report. 

92 



101 



Figure 6.1 




*Band8 extend from two estimated standard errors below to two estimated standard 
errors above the mean. Appendix A (p. 171) sives a summary o£ which modifications and 
adjustments of reading scale results are used In the tables and figures In this report. 







Modified Resultn 






Weighted Roadln'' 


Proficiency Means 


and Standard Errors 


Year 


A«e 9 (S.E.) 


Ane 13 (S.E.) 


Arq 17 (S.E.) 


1971 


207.3 (I..0) 


255.2 (0.9) 


285.4 (1.2) 


1975 


210.2 (0.7) 


256.0 (0.8) 


286.1 (0.8) 


1980 


214.8 (1.1) 


258.5 (0.9) 


285.8 (1.4) 


198A 


211.0 (1.0) 


257.1 (0.7) 


288.8 (0.9) 


1986 Adjusted 


208.6 (1.9) 


255.0 (1.6) 


286.0 (1.7) 


1988 


211.8 (1.2) 


257.5 (0.9) 


290.1 (1.1) 



93 

. 102 



Table 6.1 

Results of Adjustment of the 1986 Reading Scale^"' 











Unadiusted 1986t=l 


Adiusted 1986 




Mean 


S,E, 


S.D. 


Mean S 


S.D, 


Mean S.E, S.D. 


Age 9 


211.0 


(1.0) 


41.1 


208.9 (1.2) 


39.6 


208.6 (1.9) 38.6 


Age 13 


257.1 


(0.7) 


35.5 


259.4 (1.0) 


35.7 


255.0 (1.6) 34.7 


Age 17 


288,8 


(0.9) 


40.3 


277.4 (l.l)t'*l 


49.4 


286.0 (J_7.)_^9..^. 



Standard errors are given in parentheses. See Appendix E for 
computation of standard errors for adjusted means. Appendix A (p. 171) gives 
a summary of which modifications and adjustments of reading scale results are 
used in the tables and figures in this report. 

Results incorporate the adjustment to the sampling weights, 

^""^ Results incorporate the modification in the conditioning model. 

Standard error differs from column 4 of Table 4,1 because of a 
change in jackknife methodology. 



ERIC 



94 

103 



standard errors for the adjusted means take into account the error associated 
with estimating the equating functions. Details of the standard error 
computations are given in Appendix E,) Overlays of the 1984 and adjusted 1986 
distributions for each cohort appear in Figures 6,2 through 6,4. These can be 
contrasted with the overlays of the 1984 and original 1986 distributions for 
each cohort that appear in Figures 6.5 through 6,7.^ 

The result of the adjustment is most dramatic at age 17, where the mean 
±ncre'a'£reii"*by 8T6 poinW^sa SRe" slandaraT deviation was^ reduceB by 9 • 9^ pofnEs ; 
At age 13, the mean decreased by 4.4 points. The adjustment reduced the heavy 
upper and lower tails at age 17, as well as the heavy upper tail at age 13. 
The mean for age 9 stayed virtually the same, as did the standard deviations 
at ages 9 and 13. The changes from 1984 to the adjusted 1986 results are 
quite consistent across ages. Tlie decreases in the means are 2.4, 2.1, and 
2.8 for ages 9, 13, and 17, respectively. 

Figures 6.8 through 6.10 give another perspective on the results of the 
adjustment for each of the three cohorts. To construct these graphs, 2,000 
students were selected within each cohort. A proficiency estimate for each 
student was computed^ based on both the original and che adjusted item 
parameters. The original values were plotted along the x-axis; the adjusted 
values, along the y-axis. The changes in the tails of the distribution at age 
17 are very evident in Figure 6.10. 

^Figures 6.2 to 6.7 are based on all five plausible values. 

®The proficiency estimates used in this graph are the mean plausible values. 
Note that the same graph could have been obtained by estimating each point 
separately for each of the five sets of plausible values and obtaining a final 
estimate by averaging the five sets of x-coordinates and the five ^ets of 
y-coordinates of these points. Tlie method used here is, therefore, consistent 
with NAEP's analysis recommendations (see Mislevy, 1988). 

95 



104 



Figure 6.2 



Estimated Reading Proficiency Distributions for 1984 and 1986 (Adjusted) 

Age 9 




ERIC 



96 

105 



Figure 6.3 



Estimated Reading Proficiency Distributions for 1984 and 1986 (Adjusted) 

Age 13 



SOL t D 
DASHED 



1 9 0-4 

ADUUSTEO 1996 



P 
£ 

o p 
P o 

£ 
t 
i 
O 

n 



of 

P 
o 

p 

u 

1 

a 
t 
i 
o 

n 



10- 

P 
o 
i 
n 
t 

I 
n 
t 
e 

£ 
V 

a 
1 




Reading P£o£iciency Scale 



ERIC 



97 

lOS 



Figure 6.4 



Estimated Reading Proficiency Distributions for 1984 and 1986 (Adjusted) 

Age 17 




Reading Proficiency Scole 



ERIC 



98 

107 



Figure 6.5 



Estimated Reading Proficiency Distributions for 1984 and 1986 (Original) 

Age 9 







o . 


-1 ^ 


1 

— 


1 


1 ■■' ■ 1 1 


— 










SOL 1 D 
















~~ DA S H £ D 


„ ORIGINAL 1903 






p 




o . 


1 a 


























0 


t> 

tr 












— 


p 


e 














0 


r 














r 




o . 














t 


10- 






























0 


p 








// \ 






n 


O 


o . 


oo 












i 














o£ 


n 








* 1 *\ 
»/ * \ 








t 








• / * \ 






P 










// A 






0 


I 


o . 


OQ 










P 


n 














u 


t 














I 










» 1 






a 


r 


o . 








A 
•\ 




t 


V 










»\ 




1 


a 










%y 




0 


1 














n 












•\ 








o . 


02 






•\ 
•\ 
















•\ 
















>\ 
\\ 












■ ■ y> 




\\ 
















\\. 








o . 


oo 




1 t 


1 '^J«^ 





7S -IVS aas 27S 32tJ 375 -* 2 S 



Reading Proficiency Scale 



99 

ios 

ERIC 



Figure 6.6 

Estimated Reading Proficiency Distributions for 1984 and 1986 (Original) 

Age 13 



SOL ID- 1 3B<^ 
DASHED - ORIGINAL 1986 



P 

X 

o 

p 

o 
r 
t 
i 
o 
n 

of 

P 
o 

P 
u 

1 

a 

t 
1 
o 
n 



P 
r 

10- 

P 
o 
i 
n 
t 

I 
n 
t 

0 

r 

V 

a 

I 




Reading Proficiency Scale 



ERIC 



100 

109 



Figure 6.7 



Estimated Reading Proficiency Distributions for 1984 and 1986 (Original) 

Age 17 




ERIC 



101 

110 



Figure 6.8 



Relation Between Reading Scale Values Constructed with 
Original and Adjusted Item Parameters, Age 9 




SCALE VALUES - ORIGINAL PARAMETERS 



102 
111 

EMC 



Figure 6.9 



Relation Between Reading Scale Values Constructed with 
Original and Adjusted Item Parameters, Age 13 




103 

112 



Figure 6.10 

Relation Between Reading Scale Values Constructed with 
Original and Adjusted Item Parameters, Age 17 




ERIC 



104 

113 



Adjusted means for subgroups are given in Tables 6.2 through 6.4. At 
age 17, the adjustment led to some substantial changes in the patterns of 
subgroup differences. For example, as shown below, the adjustment increased 
the mean for Black student5= by 13.2 points and the mean for Hispanic students 
by 12.3, resulting in values close to those observed in 1984. Because the 
adjustment produced an increase of only 7.5 points for White students, the 
differences between White students and minority students in the adjusted 
results are smaller than in the unadjusted results; in fact, they are smaller 
than those observed in 1984. Also, the adjustment led to an increase of about 
15 points for the students below modal grade, whereas the mean for students at 
modal grade increased seven points and the mean for students above modal grade 
increased five points. The adjusted 1986 results resemble closely the 1984 
results. 

1984 1986 Adjusted 1986 







Mean 


S.E. 


Mean 


S.E. 


Mean 


S.E. 


White 




295.6 


(0.7) 


283.9 


(1.2) 


291.4 


(1.8) 


Black 




264.2 


(1.2) 


251.8 


(1.5) 


265.0 


(2.0) 


Hispanic 




268.1 


(1.9) 


255.3 


(2.6) 


267.6 


(2.6) 


Below Modal 


Grade 


258.8 


(1.2) 


242.8 


(1.3) 


257.7 


(1.9) 


At Modal Grade 


295.7 


(0.7) 


286.0 


(1.0) 


293.1 


(1.7) 


Above Modal 


Grade 


303.8 


(1.3) 


296.0 


(2.7) 


301.0 


(2.6) 



Summary 

Our analysis of the 1988 bridge data helped us to understand the reading 
anomaly and also provided a means of adjusting the 1986 results. The 
existence of the bridge data allowed us to equate the 1984 and 1986 
instruments through common- population equating, rather than common- item 
equating. Using the transformed item parameters that resulted from the 



105 

114 



Table 6.2 



Reading Trend Results Including 1986 Adjusted Values: 
(standard errors in parentheses) 



Age 9 



— TOTAL " 

SEX 

MALE 
FEMALE 

OBSERVED 
ETHNICITY/RACE 

WHITE [c] 

BLACK 

HISPANIC 

OTHER 

REGION 

NORTHEAST 
SOUTHEAST 
CENTRAL 
WEST 

PARENTAL EDUCATION 
NOT GRAD. H.S. 
GRADUATED H.S. 
POST H.S. 
DO NOT KNOW [d] 
MISSING 

GRADE 

< MODAL QRAD^ 
AT MODAL GRADE 
> MODAL GRADE 
MISSING 

ITEMS IK THE HOME 
0-2 ITEMS 

3 ITEMS 

4 ITEMS 

DO NOT KNOW 
HISSING 

TELEVISION WATCHED 
PER DAY [el 

0-2 HOURS 

3-5 HOURS 

6 HOURS OR MORE 

MISSING 



1971 



207. 3( 1.0) 



200.9( 1.1) 
213. 7( 1.1) 



213. 8( 1.0) 
170. 0( 1.6) 
*****( 0.0) 
193. 4( 4.7)! 



213. 0( 1.7) 
194. 3( 2.8) 
214. 4( 1.4) 
204. 6( 1.8) 



188. 4( 1.3) 
207. 7( 1.1) 
223. 7( 1.3) 
197. 4( 1.0) 
150 0(19.2)! 



177. 5( 1.2) 
216. 8( 1.1) 
231. 8( 3.7) 
174.5(10.4)1 



186. 2( 1.0) 
207. 9( 1.0) 
222. 8( 0.9) 
161. 7( 6.5) 
180.1(13.1)! 



*****( 0.0) 
*****( 0.0) 
*****( 0.0) 
0.0) 




204. A( 0.8) 
215. 9( 0.8) 



216.6( 0.7) 
181. 3( 1.1) 
182. 8( 2.3) 
207. 9( 5.1)! 



214. 8( 1.4) 
201.2( 1.1) 
215. 5( 1.1) 
207. 1( 2.0) 



190. 0( 1.2) 

211. 3( 0.9) 

221. 5( 0.9) 

203. 2( 0.8) 

*****( 0.0) 



183. 5( 1.2) 
218. 3( 0.7^ 
225. 8( 3.6) 
199. 8( 8.6)! 



193. 9( 0.9) 
212. 2( 0.7) 
225. 0( 0.8) 
177.8(14.8)! 
183. 6( 5.0)! 



*****( ^ 0) 
*****( 0.0) 
*****( 0.0) 
*****( 0,0) 



1980 



214. 8( 1.1) 



209. 7( 1.3) 
220. 0( 1.1) 



221. 3( 0.9) 
189. 2( 1.6) 
189. 5( 3.3) 
218. 5( 4.1)! 



220. 9( 2.5) 
210. 2( 2.3: 
216.5< 1.2) 
212. 4( 2.2) 



193. 9( 1.6) 
212. 7( 1.3) 
225. 9( 1.2) 
206. 0( 1.0) 
*****( 0.0) 



188. 8( 1.3) 
225. 0( 0.9) 
243. 3( 6.3)! 
190. 1( 2.2) 



197. 7( 1.4) 
216. 6( 1.0) 
227. 9( 1.0) 
167.0(10.2)! 
204. 9( 3.2)! 



219. 9( 1.1) 

222. 3( 0.7) 

211. 0( 0.8) 

153. 8( 2.4) 



1984 
211. 0( 1.0) 



207. 7( 1.1) 
214. 2( 1.0) 



218. 3( 0.8) 
185.7( 1.2> 
137. 2( 1.6)! 
222. 6( 2.7) 



215. 9( 2.0) 
204. 3( 2.7.) 
215. 6( 1.6) 
209. 1( 2.0) 



195. 1( 1.5) 
208. 9( 1.2) 
222. 9( 1.1) 
204. 4( 1.0) 
160. 8( 4.7)! 



186. 9( 1.2) 
222. 8( 0.9) 
253. 6( 5.4)! 
*****( 0.0) 



196. 4( 0.9) 
216. 6( 0.9) 
227. 1( 1.0) 
155. 7( 6.7)! 
158. 5( 3.5)! 



219. 3( 1.3) 
218. 3( 0.9) 
198. 9( 1.0) 
192. 3( 2.1)! 



Unadjusted 
1986 



Adjusted 
1986 [a] 



208. 9( 1.2) 208. 6( 1.9) 



204. 7( 1.3) 
213. 1( 1.4) 



215. 3( 1.3) 
184. 9( 1.6) 
189. 4( 3.4) 
204. 0( 6.8)! 



212. 8( 2.9) 
202. 4( 2.8)! 
213. 3( 2.7) 
206. 6( 3.0) 



189. 5( 2.S) 
201. 9( 2.0) 
219. 7( 1.3) 
200. 8( 1.6) 
189.3(10.9)! 



188. 8( 1.4) 
219. 1( 1.2) 
244.4(11.9)! 
*****( 0.0) 



194. 3( 1.5) 
210. 9( 1.4) 
221. 2( 1.2) 
169.0(17.6)! 
243.1(25.2)! 



212. 3( 1.9) 
217. 0( 1.2) 
195. 0( 1.6) 
211.5(14.1)! 



204. 5( 1.9) 
212. 8( 2.0) 



214. 9( 1.9) 
185. 0( 2.3) 
189. 8( 3.7) 
203. 7( 6.9)! 



212. 3( 3.1) 
202. 5( 3.1)! 
212. 9( 3.1) 
206. 5( 3.3) 



189. 5( 3.2) 
202. 2( 2.4) 
219. 0( 2.0) 
200. 9( 2.2) 
189.2(11.0)! 



189. 4( 2.1) 
213. 5( 1.9) 
241.9(11.5)! 
*****( 0.0) 



194. 6( 2.1) 
210. 6( 2.0) 
220. 6( 1.9) 
171.2(18.0)! 
235.3(24.4)! 



212. 1( 2.4) 
216. 4( 1.9) 
195. 2( 2.1) 
208.1(14.3)! 



1988 [b] 
211. 8( 1.2) 



207. 5( 1.5) 
216. 3( 1.4) 



217. 7( 1.5) 
188. 5( 2.6) 
193. 7( 3.9) 
228. 4 ( 5.0) 



215. 2( 2.8) 
207. 2( 2.3) 
218. 2( 2.5) 
207. 9( 2.8) 



192. 5( 5.3) 
210. 8( 2.0) 
220. 0( 1.6) 
204. 4( 1.9) 
162.3(U.3) 



192. 6( 1.8) 
222. 8( 1.5) 
262.4(10.0) 
***«<r( 0.0) 



198. 5( 2.1) 
214. 8( 1.5) 
223. 0( 1.7) 
181.4(25.0) 
160.2(13.7) 



217. 0( 1.7) 
218. 2( 1.6) 
198. 1( 1.6) 
194. 0( 7.8) 



^ a} { Using adj ^taent data and adjusted standard errors. Appendix A (p. 171) gives a sunmary of which modifications 
and adjustments of reading scale results are used in the tables and figures in this report. 

[b] Based on the 1S88 bridge to 1984 

[c] Includes Hispanic students in 1970-71 

[d] Includes "MISSING" in 1970-71» 1974-75, and 1979-80 

[e] Unavailable in 1970-71 and 1974-75 



! Interpret with caution— the sampling error cannot be accurately estimated, 
of the estimated total numbed: of students in the subpopulation exceeds 20 percent. 



since the coefficient of variation 



106 



115 



Table 6.3 



Reading Trend Results Including 1986 Adjusted Values: Age 13 
(standard errors in parentheses) 



— TOTAL — 

SEX 
KALE 
FEMALE 

OBSERVED 
ETHNICITY/RACE 

WHITE Cc) 

BLACK 

HISPANIC 

OTHER 

REGION 

NORTHEAST 
SOUTHEAST 
CENTRAL 
WEST 

PARENTAL EDUCATION 
NOT GRAD. H.S. 
GRADUATED H.S. 
KIST H.S. 
DO NOT KNOW td] 
MISSING 

GRADE 

< MODAL GRADE 
AT MODAL GRADE 
> MODAL GRADE 
MISSING 

ITEMS IN THE HOME 
0-2 ITEMS 

3 ITEMS 

4 ITEMS 

DO NOT KNOW 
MISSING 

TELEVISION WATCHED 

PER DAY [el 
0-2 HOURS 
3-5 HOURS 
6 HOURS OR MORE 
MISSING 



1971 
255. 2( 0.9) 



249. 5( 1.0) 
260. 9( 0.9) 



260. 9( 0.8) 
222. 4( 1.1) 
0.0) 

251. 4( 3.5)! 



261. 2( 2.0) 
245. 0( 1.7) 
260. 0( 1.9) 
253. 5( 1.2) 



238. 5( 1.1) 
255. 5( 0.8) 
270. 2( 0.8) 
233. 1( 1.1) 
0.0) 



229. 5( 1.0) 
264. 8( 0.8X 
278. 1( 2.6) 
225. 2( 9.8)! 



226. 6( 1.2) 
248. 9( 0.9) 
266. 5( 
218. 6( 



0.7) 
7.3)! 



224. 4( 9.9)! 



0.0) 
***mt^ 0.0) 
0.0) 
0.0) 



1975 
256. 0( 0.8) 



249. 6( 0.8) 
262. 4( 0.9) 



262. 1( 0.7) 
225. 7( 1.2) 
232. 5( 3.4) 
255. 4( 4.9)! 



258. 8( 1.8) 
249. 3( 1.5) 
261. 6( l.A) 
253. 1( 1.6) 



238. 6( 
254. 6( 
269. 9( 
234. 9( 



1.2) 
0.7) 
O.C) 
1.0) 
0.0) 



232. 3( 1.0) 
264. 9( 0.7) 
278. 1( 4.0) 
204.9(15.8)! 



231. 5( 1.2) 
249. 7( 0.8) 
267. 4( 0.7) 
218.5(14.5)! 
227.3(14.2)! 



*^***( 0.0) 
0.0) 
0.0) 
0.0) 



1980 
258. 5( 0.9) 



254. 3( 1.1) 
262. 7( 0.9) 



264. 4( 0.6) 
232. 4( 1.5) 
236. 8( 2.1) 
252. 8( 4.8)! 



260. 1( 1.8) 
252. 7( 1.7) 
264. 6( 1.5) 
256. 3( 2.1) 



238. 5( 1.3) 
253. 6( 0.8) 
270. 9( 0.8) 
233. 3( 1.7) 
0.0) 



239. 6( 1.5) 
266. 1( 0.9) 
274. 5( 4.9)! 
249.7(10.7) 



235. 8( 1.4) 
253. 1( 1.1) 
268. 5( 0.7) 
222. 9( 6.1)! 
247. 3( 6.7)! 



263. 3( 0.9) 
257. 1( 0.9) 
243. 2( 1.3) 
233. 2( 4.1) 



1984 
257. 1( 0.7) 



252. 7( 0.8) 
251. 7( 0.8) 



262. 6( O.S) 

236. 0( 1.2) 

239. 6( 1.6)! 

260. 1( 2.9) 



260. 4( 0.7) 

256. 4( 1.8) 

258. 7( 1.2) 

253. 9( 1.4) 



240. 1( 1.2) 
253. 2( 0.8) 
267. 7( 0.7) 
236. 5( 1.4) 
255. 0( 4.4)! 



239. 1( 0.9) 
266. 7( 0.6) 
295. 3( 8.6)! 
0.0) 



238. 4( 1.0) 
254. 3( 0.7) 
266. 1( 0.7) 
204.7(10.5)! 
248. 8( 5.5)! 



268. 1( 0.8) 
261. 6( 0.6) 
244. 2( 0.9) 
238. 0( 1.3) 



Unadjusted 
1986 



Adjusted 
1986 [a] 



259. 4( 1.0) 255. 0( 1.6) 



255. 1( 1.0) 
263. 6( 1.4) 



263. 4( 1.3) 

242. 9( 1.6) 

246. 3( 3.2) 

268. 8( 4.3) 



263. 6( 2.2) 
259. 0( 1.6)! 
254. 5( 3.6) 
260. 8( 1.8) 



248. 9( 2.9) 
253. 5( 1.1) 
267. 3( 0.9) 
239. 6( 3.2) 
264.5(15.8)! 



242. 2( 1.5) 
267. 7( 1.0) 
280. 4( 6.3)! 
0.0) 



242. 7( 1.6) 
256. 6( 1.3) 
266. 0( 1.0) 
278.5(81.3)! 
264.1(18.1)! 



265. 1( 1.9) 
262. 0( 1.0) 
245. 5( 1.6) 
227.6(17.8)! 



250. 9( 1.6) 
259. 1( 1.9) 



258. 8( 1.8) 
239. 3( 2.1) 
242. 2( 3.4) 
263. 9( 4.4) 



258. 7( 2.5) 

254. 8( 2.0)! 

250. 8( 3.9) 

256. 0( 2.2) 



244. 2( 3.2) 
249. 3( 1.7) 
262. 7( 1.5) 
236. 4( 3.4) 
259.6(15.8)! 



238. 4( 1.9) 
263. 0( 1.5) 
275. 8( 6.3)! 
0.0) 



238. 4( 2.0) 
252. 9( 1.7) 
263. 1( 1.6) 
275.0(81.0)! 
259.0(18.2)! 



260. 3( 2.2) 
257. 5( 1.6) 
241. 9( 2.0) 
220.9(18.6)! 



1988 tb] 
257. 5( 0.9) 



251. 8( 1.2) 
263. 0( 1.0) 



261. 3( 1.0) 
242. 9( 2.3) 
240. 1( 3.5) 
269. 3( 4.3) 



258. 6( 2.0) 
257. 6( 1.9) 
255. 9( 2.0) 
257. 9( 2.1) 



246. 5( 2.2) 
252. 7( 1.2) 
265. 3( 1.4) 
240. 4( 2.8) 
224.7(19.2) 



242. 8( 1.3) 
266. 7( 1.1) 
271.8(11.4) 
0.0) 



242. 9( 1.8) 
255. 6( 1.0) 
264. 2( 1.3) 
212.7(17.1) 
223.4(21.6) 



264. 3( 1.4) 
258. 7( 1.0) 
243. 5( 2.0) 
227. 9( 3.0) 



[a] Using adjustment data and adjusted standard errors. Appendix A (p. 171) gives a sumnary o£ which modifications 
and adjustments o£ reading scale results are used in the tables and figures in this report. 

[b] Based on the 1988 bridge to 1984 

[c] Includes Hispanic students in 1970-71 

td] Includes "MISSING" in 1970-71, 1974-75, and 1979-80 
[e] Unavailable in :970-71 and 1974-75 

! Interpret with caution— the sampling error cannot be accurately estimated, since the coefficient of variation 
o£ the estimated total number of students in the subpopulation exceeds 20 percent. 



ERIC 



107 



116 



Table 6.4 



Reading Trend Results Including 1986 Adjusted Values: Age 17 
(standard errors in parentheses) 



— TOTAL — 

SEX 
MALE 
FEMALE 

OBSERVED 
ETHNICITY/RACE 

WHITE td] 

BLACK 

HISPANIC 

OTHER 

REGIOH 

NORTHEAST 
SOUTHEAST 
CEKTRAL 
WEST 

PARENTAL EDUCATION 
NOT GRAD. H.S. 
GRADUATED H.S. 
POST H.S. 
DO NOT KNOW [e] 
MISSING 

GRADE 

< MODAL GRADE 
AT MODAL GRADE 
> MODAL GRADE 
MISSING 

ITEMS IH THE HOME 
0-2 ITEMS 

3 ITEMS 

4 ITEMS 

DO NOT KNOW 
MISSING 

TELEVISION WATCHED 
PER DAY tf] 

0-2 HOURS 

3-5 HOURS 

6 HOURS OR MORE 

MISSING 



1971 
285. 4( 1.2) 



27b. 0( 1.2) 
291. 5( 1.3) 



291. 4( 1.0) 
238. 6( 1.7) 

276. 3( 7.1)! 



292. 2( 2.5) 
270. 8( 2.5) 
290. 8( 2.1) 
283. 7( 1.7) 



261. 6( 1.5) 
283. 3( 1.2) 
302. 3( 1.0) 
261. 8( 6.5) 
0.0) 



238. 6( 1.5) 
291. 3( 1.0) 
302. 9( 1.6) 
257.6(17.6)! 



246. 2( 1.8) 
273. 9( 1.4) 
295. 6( 1.0) 
238.3(18.0)! 
284.6(57.6)! 



0.0) 
0,0) 
**«*^( 0.0) 
0.0) 



1975 
286. 1( 0.8) 



280. 1( 0.9) 
291. 8( 0.9) 



293. 0( 0.6) 
240. 4( 1.9) 
252. 2( 3.6) 
275. 3( 4.3)! 



289. 5( 1.7) 
277. 3( 1.4) 
291. 9( 1.5) 
282. 3( 1.8) 



263. 3( 1.4) 
281. 7( 1,0) 
300. 9( 0.7) 
240. 2( 2.8) 
0.0) 



242. 8( 1.8) 
292. 5( 0.7) 
301. 8( 1.0) 
259.5(13.2)! 



251. 7( 2.1) 
275. 8( 1.1) 
296. 1( 0.6) 
205.3(21.6)! 
239.4(11.7) 



0.0) 
0.0) 
**^**( 0.0) 
0.0) 



1980 
285. 8( 1.4) 



282. 1( 1.4) 
289. 5( 1.4) 



293. 1( 1.2) 
242. 5( 2.0) 
260. 7( 3.3) 
280. 7( 4.0) 



285. 4( 2.4) 
281. 0( 2.6> 
288. 6( 3.2) 
286. 6( 1.7) 



261. 9( 1.7) 
277. 4( 1.1) 
299. 3( 1.2) 
249. 5( 3.9) 
**^**( 0.0) 



243. 8( 2.3) 
291. 5( 1.2) 
301. 2( 2.2) 
241.3(22.3)! 



257. 6( 
278. 5( 
295. 6( 

244. 2( 



2.2) 
1.8) 
1.1) 
0.0) 
5.8)! 



1984 
288. 8( 0.9) 



283. 8( 0.9) 
293. 9( 1.1) 



295. 6( 0.7) 
264. 2( 1.2) 
268. 1( 1.9)! 
284. 5( 3.1) 



292. 0( 2.1) 
284. 6( 2.3) 
290. 1< 1.5) 
289.1( 1.6) 



Unadjusted 
1986 



Adjusted 
1986 tb] 



277. 4( l.Dta] 286. 0( 1.7) 



269. 3( 
281. 1( 1.0) 
301. 2( 0.8) 
256. 5( 
280. 8( 



1.4) 



2.1) 
7.9)! 



291. 0( 1.3) 
277. 1( 1.3) 
257. 7( 2.9) 
256. 1( 7.2) 



258. 8( 1.2) 
295 7( 0.7) 
303. 8( 1.3) 
0.0) 



264. 1( 1.4) 
283. 0( 
296. 3( 0.8) 
220. 6( 8.3)! 
221.4(10.4)! 



297. 4( 0.9) 
284. 5( 0.9) 
267. 8( 1.4) 
245. 9( 2.2) 



269. 8( 1.4) 
285. 2( 1.2) 



283. 9( 1.2) 
251. 8( 1.5) 
255. 3( 2.6) 
266. 4( 5.5) 



286. 0( 2.5) 
269. 4( 1.3) 
279. 7( 2. CO 
273. 5( 1,9) 



253. 4( 1.7) 
264. 9( 1,0) 
289. 3( 1.1) 
237. 9( 2.7) 
249. 3( 7.7)! 



242. 8( 1.3) 
286. 0( 1.1) 
296. 0( 2.7) 
0.0) 



250. 7( 1.6) 
270. 9( 1.4) 
286. 5( 1.0) 
222.9(14.1)! 
240. 7( 9.9)! 



287. 5( 1.3) 
273. 5( 1.0) 
249. 1( 1.6) 
241. 2( 9.8)! 



279. 6( 1.9) 
292. 6( 1.8) 



291. 4( 1.8) 
265. 0( 2.0) 
267. 5( 2.6) 
276. 0( 4.7) 



293. 1( 2.5) 
279. 4( 1.8) 
288. 1( 2.5) 
282. 7( 2.2) 



266. 3( 2.1) 
275. 9( 1.7) 
295. 8( 1.7) 
253. 4( 2.7) 
261. 8( 6.4)! 



257. 7( 1.9) 
293. 1( 1.7) 
301. 0( 2.6) 
***^*( 0,0) 



264. 1( 2.0) 
280. 6( 1.9) 
293. 5( 1.7) 
242,1(11.1)! 
254. 8( 8»1)! 



294. 3( 1.8) 
282. 8( 1.7) 
263. 0( 2.0) 
255. 6( 8.0)! 



lg'8 tc] 
290. 1( 1.1) 



286. 0( 1.5) 
293. 8( 1.6) 



294. 7( 1.3) 
27/j.4( 2.6) 
270. 8( 4.0) 
290. 0( 5.7) 



294. 8( 2.5) 
285. 5( 2.1) 
291. 2( 1.8) 
289. 0( 2.2) 



267. 4( 2.4) 
282. 0( 1.5) 
299. 5( 1.3) 
254. 7( 6.1) 
230.5(27.8) 



265. 4( 2.2) 
296. 5( 1.1) 
304. 6( 2.6) 
0.0) 



268. 8( 2.4) 
287. 1( 1.7) 
295. 8( 1.2) 
227.4(14.0) 
199.0(17.9) 



295. 6( 1.2) 
285. 4( 1.8) 
268. 6( 4.1) 
232.9(10.3) 



\tl ,^*^f"^^^^ differs from column 4 of Table 4.1 beccuse of a change in jackknife methodology 

and ad'uJt«inf« o/ A"^^^^^^^ "^^i*"^ and adjustad standard errors. Ai endix A (p. 171) gives a sumnary of which modifications 
and adjustments of reading scale results are used in the tables anu ngures in this report. 

[c] Based on the 1988 bridge to 1984 

[d] Includes Hispanic students in 1970-71 

te] Includes "MISSING" in 1970-71. 1974-75, and 1979-80 
[f] Unavailable in 1970-71 and 1974-75 

sampling error cannot be accurately estimated, since the coefficient of variation 
ot the estimated total number of students in the subpopulation c jeeds 20 percent. 



108 



ERIC 



117 



procedure, adjusted 1986 results were produced. At age 17, the adjustment 
resulted in an increase of 8.6 points in the mean and a decrease of 9.9 points 
in the standard deviation. Some changes occurred in the patterns of subgroup 
differences. At age 13, the mean decreased by 4.4 points and, at age 9, the 
results were essentially unchanged. The adjusted 1986 results are 2 to 3 
points lower than the 1984 results at all three ages. 

Our findings on the impact of item position and context lead us to the 
conclusion that common-item equating procedures should not be assumed to be 
appropriate when form changes have taken place. This reinforces the 
importance of using intact blocks of items for purposes of scale equating in 
NAEP. In all cases in which the estimation of trend is planned, our designs 
for 1990 and 1992 include intact blocks that retain the same position within 
the assessment booklet. This more conservative approach should greatly 
diminish the likelihood of anomalous results such as those that occurred in 
1986. 



109 



118 



Chapter 7 

MATHEMATICS AND SCIENCE TREND DATA ANALYSIS 
Kentaro Yamamoto^ 

Because mathematics and science Items were part of the bridge samples 
used In 1988 to illuminate the anomalous 1986 reading results, data for these 
subject areas were also available for trend analyses. This chapter describes 
the technical details of the item-parameter estimation ard scaling performed 
for treiid analyses of responses to mathematics and science cognitive items in 
the 1988 assessment. 

To maiiitain the comparability of measurement instruments, booklets for 
the 1988 reading bridge to 1986 were identical to those used in 1986 and 
therefore included science and mathematics blocks. The 1988 mathematics anJ 
science trend analyses aro limited to data from blocks that appeared in the 
same booklets as reading blocks in the 1986 assessment. For age 17, the 
niamber of mathematics and science blocks available for trend a'^alysis was 
fewer in 1988 than in 1986. However, since every 1986 trend booklet for ages 
9 and 13 contained a block from each of the three subject areas, the. complete 
sets of trend blocks for those ages were available for analysis in 1988 » 



^Maxine Kingston , Edward Kulick , Michael Narcowich , and Mihhwei Wang 
pejrformed the data analyses for this chapter; Jo-Ling Liang and Edward Kulick 
produced the figures. Robert Mislevy provided consultation on scaling and 
Rebecca Zwick provided valuable editorial assistance. 

Ill 



119 



The combination of blocks within booklets, the composition of item 
blocks, the mode of administration, the sample definition, and the time of 
testing were identical for the age 9 and age 13 samples in the 1986 assessment 
and the 1988 bridge to 1986. Consequently, trend analyses for these two ages 
were straightforward, but analyses of trends for age 17 ware not. 

In 1986, the reading trend for age 17 was assessed as part of the BIB 
spiral portion of th assessment, while the science and mathematics trends 
were assessed apart from readi.^? under a paced-tape mode of administration. 
Since the overarching aim of the 1988 bridge study was to replicate the 
booklets and administration procedures for the 1986 assessment of trends in 
reading, booklets from the BIB spiral portion of the 1986 assessment were 
again administered in 1988 under the same administration conditions as in 
1986. In particular, the administration of mathematics and science items in 
th- '■piral portion was by paper and pencil, rather than by paced tape. This 
means that the data from the 1988 age 17 trend assessments of mathematics and 
science are comparable to the 1986 BIB assessment and not directly to the 1986 
trend assessment. This made the equating design to align tht 1988 trend point 
for age 17 student to the past trend more complicated than before. For age 17 
in 1988, two types of equating were necessary- -one based on common populations 
across different modes of administration for the 1986 BIB and trend, and one 
based on common items (similarly placed) for the 1986 BIB and the 1988 trend. 

The main objective of the 1988 trend assessments of mathematics and 
science was to evaluate the differences between the 1986 and 1988 assessments. 
The 1988 trend point was to be added to the existing trend line. Since these 
analyses closely follow those conducted in 1986, readers desiring more 
detailed descriptions are referred to various chapters in the 1986 technical 



112 

120 



report (Beaton, 1988b), such as Chapter 8 by Mislevy for the underlying theory 
of raeasurenient and the imputation of plausible values • Chapter 3 by Hansen, 
Rust, and Burke for sampling design. Chapter 14 by Johnson, Burke, Braden, 
Hansen, Lago, and Tepping for sample weights. Chapter 10 by Johnson for 
mathematics data analysis, and Chapter 11 by Yamamoto for science data 
analysis* This chapter will consider details specific to the 1988 analysis. 

7.1 Sampling of Students and Items for Mathematics and Science 

For ages 9 and 13, the combination of blocks, composition of item 
blocks, mode of administration, sample definitions, and time of testing were 
identical in 1986 and the 1988 bridge to 1986. Three booklets, identical to 
those used in 1986, were used to measure trend for these ages. Each booklet 
contained one reading, one mathematics, and one science block. Each student 
in die sample was administered one of these, booklets. The mathematics and 
science portions were presented aurally usirxg a tape recorder as in past 
assessments. The tape recorder was turned off for the reading block. 

For age 17, the mathematics and science booklets of the 1986 trend 
assessment were not used in 1988, since the 1986 mathematics and science trend 
booklets for age 17 did not iiiclude reading tasks. Instead, the booklets used 
in 1988 were identical to a subset of booklets used for the 1986 BIB 
assessment and consisted of six booklets, five of v/hich contained at least one 
reading bl'^ck and either a mathematics or a science trend block from the 1986 
assessment. The sixth booklet, which did not contain mathematics or science 
blocks, was included for the reading assessment in 1988* Only one of the two 
trend blocks of either mathematics or science was included in four of the 
booklets; the fifth booklet contained both a mathematics and a science block. 



113 



ERIC 




The 1988 age 17 sample was defined using the same age definition as the 1986 
BIB assessment and received a print- administered assessment instead of the - 
paced- administration of the pr8-1988 trend assessments. Unlike the samples at 
ages 9 and 13. in which every student received both a mathematics and a 
science block, about one-fifth of the age 17 sample received both; the rest 
received a block of either mathematics or science items. 

The proficiencies of the three ages cannot be placed on a single scale 
without a cross -sectional study or a vertical equating across ages, neither of 
which were possible in the 1988 mathematics and science trend assessment. The 
mathematics and science scales were derived, from the 1986 cross -sectienal 
assessment (see E. G. Johnson. 1988d. and Yamamoto. 1988). The 1988 trend 
analysis added a new trend point to the existing trend line up to 1986. 

The specific mathematics and science samples for 1988 and 19/j6 follow. 



Sample ' 
(vrrage) Type 



Time Mode 



Age 
Definition 



Sample 



Modal 
Size Grade 



Mathematics and Science 
86:9a 
88:9c 



Bridge 


Winter 


Tape 


Calendar 


Bridge 


Winter 


Tape 


Calendar 


Bridge 


Fall 


Tape 


Calendar 


Bridge 


Fall 


Tape 


Calendar 



yr. 


Age 


6932 


4 


yr. 


Age 


3711 


4 


yr. 


Age 


6200 


8 


yr. 


Age 


3942 


8 



Mathematics 
86:17 Mai-^ 



Q/: „ Spring Print Not calendar yr. Age/grade 6151* 

»6,i7b Bridge Spring Tape Not calendar yr. Age 3868 
88:17c Bridge Spring Print Not calendar yr. Age/grade 1852 



Science 

86:17 Main 



Spring Print Not calendar yr. Age/grade 5611* 
a6:17b Bridge Spring Tape Noc calendar yr. Age 3868 
88:17c Bridge Spring Print Not calendar yr. Age/grads 1862 



11 
11 
11 



11 
11 
11 



* Kuabet ox B«e-only BIB aaople students who onsweted any one of the trend blocks 



114 



122 



The Items used for the analysis of the 1988 data set are the same as 
those used for the 1986 trend analyses; that is, the same items were excluded 
as in 1986 for reasons of lack of fit of the estimated item response function 
to the empirical regression curve. Three mathematics items, one from each age 
group, were excluded from scaling. Three science items were dropped from the 
scaling for age 9 and three from the scaling for age 17; one science item was 
dropped for age 13. 

Using current methods, it is possible to assess the change over time in 
either item characteristics or proficiencies of populations, but not both at 
the same time. This is true for any analysis, whether based on classical test 
theory, item response theory, or proportions correct. To assess change in 
item characteristics, we are forced to assume that the ability distribution of 
the population remains stable; to assess change in the ability distribution of 
the population, we must assume the stability of item characteristics (see the 
discussion of common- item equating in Chapter 6). However, we know that this 
is not strictly true. Societal and instructional changes may produce gradual 
alterations in item functioning over time. If there is evidence that this is 
occurring, it may be desirable to allow for changes in the parameters of these 
common items. Permitting item characteristics to vaxy in this way is feasible 
only if common-population equating methods are available to link th^ newly 
obtained results to past trend lines. This is the approach that was used in 
analyzing the 1988 mathematics data at age 17 and science data at all three 
ages. 



115 



ERIC 



123 



7.2 Scaling of the Mathematics Trend Data 
Ages 9 and 13 

From the item analysis, it was found that the 1988 response 
distributions of all response choices, including "omits," were quite similar 
to the 1986 data. The mean weighted proportion correct at the block level was 
computed; these values were compared with the 1986 results, as shown in Table 
7.1. At each block level for all age groups, the 1988 sample showed higher 
weighted proportion correct values than the 1986 sample. 

In estimating item parameters in 1986, combined data from the three most 
recent trend assessments (1977, 1982, and 1986) were used. Thus, the 1986 
trend analysis assumed the characteristics of all items were stable across the 
three assessments. Item parameters estimated in 1986 were kept xinchanged for 
the 1988 assessment for ages 9 and 13. Consequently, the same constants were 
used to transform provisional imputed values to the mathematics proficiency 
scale . 

To justify the use of the parameter estimates from the 1986 assessment, 
the fit of the IRT item parameters to the 1988 bridge data was examined by 
means of the chi-square test. At ages 9 and 13, the use of previously 
estimated item parameters appeared to be justified, but this was not the case 
at age 17. Hence, the item parameters applicable to age 9 and age 13 were 
kept unchanged for the mathematics trend analysis; they are presented in 
Tables D.8 and D.9 in Appendix D. 

The coexistence of item parameters that fit in various degrees to the 
data from a particular year comes from the need to place several samples from 
different years on a scale based upon coiranon-item equating. When common-item 
parameters are estimated on multiple data sets, the fit of the estimated item 

116 



ERIC 



124 



Table 7.1 

Mathematics Weighted Mean Proportion Correct 



Block 1986 (N) 



1988 (m 



Age 17 
(paper) 



Age 17 
(taped) 



1 

2 
3 

Total 
Noncalculator 

1 
2 

3 

Total 
Noncalculator 



59.1 
63.4 
65.3 
52.3 
61.0 



3 
1 
5 
0 

60.8 



60 
62 
64 
62 



(2211) l»l 
(2233) 
(2263) I"! 
(6151) I"! 

(1934) t''' 
(1934)'''' 
(1934) I''! 
(3868) t''' 



61.3 
65.7 
67.0 
64.4 
62.7 



( 619) 
( 624) 
( 609) 
(1852) 



Iteml«l 

35 
35 

24 (19) 

94 

75 

35 
35 

24 (19) 

94 

75 



Age 13 


1 


63.9 


(2075) 


65.3 


(1405) 


37 


(taped) 


2 


58.5 


(2054) 


60.5 


(1281) 


37 




3 


57.4 


(2071) 


60.0 


(1256) 


24 




Total 


60.3 


(6200) 


62.2 


(3942) 


98 




Noncalculator 


61.4 




63.2 




82 



Age 9 


1 


55.2 


(2315) 


58.2 


(1274) 


26 


(taped) 


2 


57.3 


(2361) 


62.4 


(1240) 


26 




3 


73.0 


(2256) 


76.7 


(1197) 


16 




Total 


60.2 


(6932) 


64.2 


(3711) 


68 




Noncalculator 


57.1 




62.1 




57 



Age-only BIB sample with at least one mathematics trend block. 

1986 Age 17 trend sample blocks 1 and 2 were paired. 

Includes some items that were excluded from IRT scaling; 
parentheses in this column indicate the number of calculator items excluded 
from IRT scaling. 



ERLC 



117 



185 



regression curve to the weighted means of proportions correct, given an 
ability level, is maximized. Because of this averaging, it is possible that 
the estimated item parameters fit very well to the combined data sets as a 
whole, but less well to each data set separately. 

For Ages 9 and 13, the same common-item equating procedure that was 
employed in the 1986 trend analysis was used to align the 1988 point to the 
trend up to 1986. A brief description of the procedure follows. From the 
item parameters estimated in 1986 and background variables of 1988, the 
proficiency scores were imputed for the 1988 bridge data for each age using 
the M-GROUP computer program based on the ^plausible values methodology 
(Sheehan, 1985; see Mislevy, 1988, for a detailed discussion). The 
conditioning variables and the estimated conditioning effects for ages 9 and 
13 are given in Tables D.IO and D.ll in Appendix D. The same linear constants 
of 1986 were used to transform provisional imputed scores to the final 
proficiency scores of the mathematics trend. The transformation constants for 
all three ages are listed in Table 7.2. 

Table 7.2 

Coefficients of the Linear Transformation 
of the Trend Scale from Original Units 
to the Mathematics Proficiency Scale 

A ge Intercept Slope 

9 218.42 35.84 

13 266.58 34.57 

17 303.25 31.84 

Age 17 

For age 17, new item parameters were estimated using the subsample from 
the 1986 BIB assessment equivalent to the 1988 trend sample. Use of the 

118 



126 



estimated item parameters from 1986 is not appropriate for the 1988 assescrient 
for age 17, because of the different mode of administration for the 1986 and 
the 1988 trend assessments for that age. For example, on all five items of a 
type referred to as "estimate" items, use of paper and pencil instead of a 
tape recorder had a dramatic effect. "Estimate" items ask the student to 
select an answer among several options, all of which are rounded so that none 
of diem is exactly correct. The property of the response options is indicated 
by the word "about" being positioned before "how much" or "how many" in a 
question. When an "estimate" item was presented under taped administration, 
enough time was allowed for rough estimation of the (typically) large number, 
but not enough time was allowed for the numerical calculation of the answer. 
However, because under paper -and-pencil administration it is possible to spend 
more time to answer, the examinee may opt to perform the calculation rather 
than the estimation. In such a case, it is more appropriate to treat an 
"estimate" item as two different items under different modes of 
administration. The observed item regression curves of the 1986 BIB data and 
1986 bridge data of one of the "estimate" items are presented in Figure 7.1. 

Therefore, for age 17, both equating methods, common- item (between the 
1986 BIB and 1988 bridge samples) and common- population (between the 1986 BIB 
and 1986 bridge samples), were used to place the 1988 trend sample on a scale 
comparable to the 1986 reported scale. The procedure took place es follows. 
The item parameters for the total set of 73 items were estimated based on the 
two data sets: the 1986 BIB assessment and the 1988 bridge to 1986. Both 
samples included grade- and age-eligible students in order to maintain an 
adequate sample size for the estimation accuracy. This resulted in a second 
set of item parameters for age 17* The new item parameters are listed in 

119 



Figure 7.1 

A Plot of Observed Proportion Correct of 
the 1986 BIB Spiral and Trend Assessments with the Estimated 
Item Regression Curve for an "Estimate" Item 




120 

128 



Table D.12 in Appendix D; the old parameters appear in Beaton (1988b). The 
rationale for estimating parameters for all items instead of only "estimate" 
items comes from the main objective of the 1988 bridge to 1986, namely to 
examine the possibility of effects due to changes in assessment procedures, 

Froia the above estimated item parameters and background information for 
the appropriate sample, proficiency scores were imputed for each student in 
the 1986 BIB and 1988 bridge-to-1986 samples. The conditioning variables and 
the estimated conditioning effects for age 17 are given in Table D.13 in 
Appendix D, Then the mean and standard deviation of the imputed scores of the 
age-only subsample of the 1986 BIB wer^ calculated. Constants were found to 
match the means and standard deviations of the proficiency scores of the 1986 
trend sample and the age-only subsample of the 1986 BIB sample. Subsequently, 
by applying the same linear transformation to the provisional imputed values 
of the 1988 trend age-only sample, the 1988 trend point was aligned with the 
trend line up to 1986. The transformation constants for age 17 data are 
listed in Table 7.2. 

The trends in mean proficiency with jackknifed standard errors for 
subpopulations of the three age samples are listed in Tables 7.3, 7.4, and 
7.5. The 1986 and 1988 posterior distributions of mathematics proficiency 
were calculated for each cohort separately at 40 quadrature points. Overlays 
of distributions from the two assessment years appear in Figures 7.2 through 
7.4. For age 17, the 1986 distribution is calculated on the 1986 bridge 
sample as well as on the age-only subsample of the 1986 BIB sample, which is 
comparable to the 1988 bridge sample of age 17. The shape of the 
distributions ^f the two assessments is quite similar for ages 9 and 13. 



121 




Table 7.3 



Weighted Mathematics Proficiency Means 
and Standard Errors for Age 9 



Subgroup 
Total 

Sex 

Male 
Female 

Ethnicity 
White 
Black 
Hispanic 
Other 

Grade 
< modal 
— modal 
> modal 

Region 
Northeast 
Southeast 
Central 
West 



1978 
Mean S.E. 



1982 
Mean S.E. 



1986 
Mean S,E. 



1988 
Mean S.E. 



218.6 ( 0.8)* 219.0 ( 1.1)* 221.7 ( 1.0)* 229.0 ( 1.1) 



217.4 ( 0.7)* 
219.9 ( 1.0)* 



224.1 ( 0.9)* 
192.4 ( 1.1)* 
202.9 ( 2.3)* 

227.2 ( 3.2)* 



190.9 ( 1.1)* 
228.5 ( 0.9)* 
240.5 ( 5.7) 



226.9 ( 1.9) 
208.9 ( 1.2)* 
224.0 ( 1.5)* 
213.5 ( 1.4)* 



217.1 ( 1.2)* 
220.8 ( 1.2)* 



224.0 ( 1.1)* 
194.9 ( 1.6)* 
204.0 ( 1.3)* 
238.5 ( 4.2) 



193. i ( 1.4)* 
230.1 ( 1.0)* 
258.3 (11.0) 



225.7 ( 1.7) 
210.4 ( 2.9)* 
221.1 ( 2.4)* 
219.3 ( 1.7)* 



221.7 ( 1.1)* 
221.7 ( 1.2)* 



226.9 ( 1.1)* 
201.6 ( 1.6) 
205.4 ( 2.1)* 
221.8 ( 7.5)* 



198.1 ( 1.0)* 
233.8 ( 1.0)* 
248.8 (10.8) 



226.0 ( 2.7) 
217.8 ( 2.5) 
226.0 ( 2.3)* 
217.2 ( 2.4)* 



229.1 ( 1.6) 
229.0 ( 1.1) 



234.5 ( 1.2) 
206.3 ( 2.6) 
215.9 ( 3.4) 
242.9 ( 4.2) 



208.8 ( 1.8) 

239.0 ( 1.2) 

260.1 ( 9.7) 



233.5 ( 3.1) 
222.4 ( 2.9) 
233.9 ( 1.7) 
226.9 ( 2.0) 



* Shows statistically significant difference from 1988, where 
a - .05 per set of three comparisons (each year compared to 1988). 



122 
130 



Table 7.4 



Weighted Mathematics Proficiency Means 
and Standard Errors for Age 13 



Subgroup 
Total 

Sex 

Male 
Female 

Ethnicity 
White 
Black 
Hispanic 
Other 

Grade 
< modal 
- modal 
> modal 

Region 
Northeast 
Southeast 
Central 
West 



1978 
Mean S.B. 



1982 
Mean S.E. 



1986 
Mean S.E, 



1988 
Meat> S.E. 



264.1 ( 1.1)* 268.6 ( 1.1)* 269.0 ( 1.2)* 273.3 ( 0.8) 



263.6 ( 1.3)* 269.2 ( 1.4)* 270.0 ( 1.1)* 275.3 ( 1.1) 

264.7 Cl.l)* 268.0 ( 1.1) 267.9 ( 1.5) 271.2 ( 1.0) 



271.6 ( 0.9)* 

229.6 ( 1.9)* 

238.0 ( 2.2)* 

272.5 ( 3.5) 



239.6 ( 1.4)* 
273.8 ( 1.1)* 
297.6 ( 7.7) 



274.4 ( 1.0)* 

240.4 ( 1.6)* 

252.4 ( 1.6) 

274.5 ( 3.8) 



247.2 ( 1.4)* 
276.6 ( 0.9)* 
303.9 ( 7.6) 



273.6 ( 1.3)* 

249.2 ( 2.3) 

254.3 ( 2.9) 

282.7 ( 3.4) 



251.1 ( 1.1)* 
277.6 ( 1.0)* 
296.9 ( 7.7) 



279.1 ( 0.9) 

250.3 ( 1.2) 

254.7 ( 3.9) 

288.6 ( 5.9) 



255.8 ( 1.1) 
282.5 ( 0.8) 
320.5 ( 7.2) 



272.7 


( 


2.4) 


276. 


9 


( 


2.2) 


276.6 


( 


2.2) 


278.7 


( 


2.2) 


252.7 


( 


3.2)* 


258. 


1 


( 


2.4)* 


263.5 


( 


1.4) 


268.2 


( 


2.9) 


269.4 


( 


1.8) 


272. 


8 


( 


1.9) 


266.1 


( 


4.5) 


271.3 


( 


1.5) 


260.0 


( 


1.9)* 


266. 


0 


( 


2.3)* 


270.4 


( 


2.1) 


274.6 


( 


1.7) 



* Shows statistically significant difference from 1988, where 
a - •OS per set of three comparisons (each year compared to 1988), 



123 



131 



Table 7.5 



Weighted Mathematics Proficiency Means 
and Standard Errors for Age 17 



Subgroup 
Total 



1978 
Mean S.E. 



1982 
Mean S.E. 



1986 
Mean S.E. 



1986 (BIB) 
Mean S.E. 



1988 
Mean S.E. 



300.4 (0.9)* 298.5 (0.9)* 302.0 (0.9) 302.0 (0.8) 305.4 (1.2) 



Sex 

Male 
Female 

Ethnicity 
White 
Black 
Hispanic 
Other 



303.8 (1.0) 
297.1 (1.0)* 



305.9 (0.9) 
268.4 (1.3)* 
276.3 (2.2)* 
312.9 (3.4) 



301.5 (1.0)* 

295.6 (1.0)* 



303.7 (0.9)* 

271.8 (1.3)* 
276.7 (2.0)* 
309.4 (8.8) 



304.7 (1.2) 
299.4 (1.0)* 



307.5 
278.6 
283.1 



(1.0) 

(2.1)* 

(2.9)* 



304.7 (7.2) 



303.2 (1.0) 
300.8 (0.8)* 



306.4 (0.9) 
280.0 (1.0)* 
286.0 (1.8)* 
314.6 (6.0) 



306.7 (1.8) 
304.2 (1.4) 



309.5 (1.4) 

289.2 (2.1) 

294.3 (3.5) 
314.2 (7.0) 



Grade 
< modal 
— modal 
> modal 



272.7 (1.1)* 
304.7 (1.0)* 
309.3 (1.0) 



274.1 (1.5)* 
302.5 (0.9)* 
306.5 (1.4) 



277.3 (1.6)* 
306.7 (0.9)* 
309.1 (3.0) 



278.9 (1.4)* 
307.7 (0.7)* 
312.3 (2.0) 



283.4 (1.8) 

312.5 (1.1) 
314.2 (3.0) 



Region 

Northeast 306.7 (1.7) 

Southeast 292.3 (1.7)* 

Central 305.2 (1.8) 

West 295.5 (1.8)* 



304.0 (2.1) 
292.3 (2.1)* 

302.0 (1.1) 

294.1 (2.0)* 



307.4 (1.9) 
297.3 (1.4) 



303.6 
299.3 



(1.9) 
(2.7) 



308.4 (1.4) 
295.0 (1.2) 
303.7 (1.6) 
299.9 (1.7) 



309.5 (2.6) 
300.2 (2.3) 
305.5 (2.8) 
306.4 (2.2) 



* bhows statistically significant difference from 1988, where a - .05 
per set of four comparisons (each year compared to 1988) . 



124 

132 



Figure 7,2 



Estimated Mathematics Proficiency Distributions 
for the 1986 and 1988 Bridge Samples, Age 9 




0 100 ?00 300 ^00 



PROFICIENCY SCALE 



125 



133 



Figure 7*3 



Estimated Mathematics Proficiency Distributions 
for the 1986 and 1988 Bridge Samples, Age 13 




100 200 300 400 

PROFICIENCY SCALE 



ERLC 



126 



Figure 7.4 



Estimated Mathematics Proficiency Distributions 
for the 1986 Age-only BIB Sample and the 1988 Age-only Bridge Sampl 

Age 17 




127 



135 



i7 



In 1986, using the range of student performance on the NAEP mathematics 
scale, five levels of mathematics proficiency were established end described 
in detail in the Mathematics Report Card (Dossey, Mullis, Lindquist, & 
Chambers, 1988): Level 150--Simple Arithmetic Facts, Level 200- -Beginning 
Skills and Understanding, Level 250- -Basic Operations and Beginning Problem 
Solving, Level 300- -Moderately Complex Procedures and Reasoning, and Level 
350- -Multi-step Problem Solving and Algebra. Table 7.6 shows the percentage 
of students at ages 9, 13, and 17 who attained each level of proficiency in 
the 1978, 1982, 1986, and 1588 assessments • 

7.3 Sealing of the Science Trend Data 

The 1988 science trend analysis followed procedures and methods similar 
to those for the mathematics analysis. From the item analysis, it was found 
that the 1988 response distributions of all response choices, including 
"omits," were quite similar to the 1986 data. The mean weighted proportion 
correct at a block level was computed; these values were compared with the 
1986 results, and are presented in Table 7.7. 

In 1986, item parameters were estimated for the age 9, 13, and 17 
samples. The trend items for age 13 and age 17 were estimated together 
because the majority of the items were common to both ages. For the 1988 
data, because of the change in the mode of administration for age 17, those 

«ms had to be estimated separately from the age 13 items. To obtain the 
best estimates of proficiencies for the two years, items for age 13 were 
reestimated using BILOG (Mislevy & Bock, 1983) on the 1986 and 1988 bridge 
data sets. For age 9, it was found that the 1986 score key for one of 63 
items did not distinguish "I don't know," hence the responses to that item 

128 



ERLC 



Table 7.6 



Mathematics Trends for 9-, 13-, and 17 -Year-Old Students; 
Percentage of Students at or Above 
the Five Proficiency Levels, 1978-1988 



Proficiency Levels 

Level 150 
Simple Arithmetic 
Facts 

Level 200 
Beginning Skills 
and Understandings 

Level 250 
Basic Operations and 
Beginning Problem Solving 

Level 300 
Moderately Complex 
Procedures and Reasoning 

Level 350 
Multi-step Problem 
Solving and Algebra 



1978 
Age Mean S . E . 



Assessment Year 
1982 1986 
Mean S . E . Mean S . E . 



1988 
Mean S.E. 



9 96.5 (0.2) 97.2 (0.3) 97.8 (0.2) 99.0 (0.2) 

13 99.8 (0.0) 99.9 (0.0) 100.0 (0.0) 100.0 (0.0) 

17 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 

9 70.3 (0.9)* 71.5 (1.1)* 73.9 (1.1)* 81.5 (1.1) 

13 94.5 (0.4) 97.8 (0.4) 98.5 (0.2) 98.6 (0.3) 

17 99.8 (0.0) 99.9 (0.1) 99.9 (0.1) 100.0 (0.0) 

9 19.4 (0.6)* 18.7 (0.8)* 20.8 (0.9)* 27.0 (1.3) 

13 64.9 (1.2)* 71.6 (1.2)* 73.1 (1.5) 76.8 (0.9) 

17 92.1 (0.5) 92.9 (0.5) 96.0 (0.4) 97.5 (0.4) 

9 0.0 (0.1) 0.6 (0.1) 0.6 (0.2) 1.4 (0.3) 

13 17.9 (0.7) 17.8 (0.9) 15.9 (1.0)* 20.5 (0.9) 

17 51.4 (1.1)* 48.3 (1.2)* 51.1 (1.2)* 58.7 (1.5) 



9 0.0 (0.0) 
13 0.9 (0.2) 
17 7.4 (0.4) 



0.0 (0.0) 
0.5' (0.1) 
5.4 (0.4) 



0.0 (0.0) 
0.4 (0.1) 
6.4 (0.4) 



0.0 (0.0) 
0.7 (0.2) 
6.5 (1.0) 



* Shows statistically significant difference from 1988, where a - .05 
per set cf three comparisons (each year compared to 1988). (No significance 
test is reported when the proportion of students is either >95.0 or <5.0.) 
Standard errors are presented in parentheses. 



129 



137 



ERIC 



were treated as wrong when they should have been treated as "omit." This 
error was found only in the 1986 bridge data set for age 9. The consequence 
of this error on the proficiency score is very small for two reasons: It 
involved only 8 percent of the responses for a particular item, and the 
subjects who selected the "I don't know" option had the lowest mean proportion 
correct among all options. In fact, using the trend item parameters from 1986 
estimated on the incorrect data sets, we compared the means of the ability 
distributions of two data sets with and without correction of the 1986 age 9 
trend and found that they differed by about .07 in the proficiency scale. In 
order to assess administration effect as accurately as possible, however, the 
item parameters for all items were estimated for age 9 based on the 1986 and 
1988 corrected bridge data sets. The estimated item parameters for three ages 
, are listed in Tables D.14, D.15, and D.16 in Appendix D. 

The imputed proficiency values of the 1988 sample were calculated from 
the responses on cognitive items and background questions based on the item 
parameters estimated on the trend samples of 1986 and 1988. At this point, 
the imputed values of the 1988 sample were not comparable to the trend scale 
of 1986. Note that the 1986 sample was used to obtain two separate sets of 
trend item parameters, the one for the data up to and including 1986 and the 
other for the data from 1986 and 1988. This design enabled xxs to use common- 
population equating based on the same sample, and also to expre.^s the 
difference in the distribution of proficiency between 1986 and 1988 in terms 
of the trend scale established in 1986. The linear transformations were 
derived separately for ages 9 and 13 to match, within each age cohort, the two 
means and standard deviations of proficiencies of the 1986 bridge sample, one 
based on the item parameters estimated on the data until 1986 and the other 

130 



ERLC 



138 



Table 7.7 



Science Weighted Mean Proportion Correct 





Block 


1986 


(N) 


1988 


(N) 


Item' 


Age 17 


1 


60.5 


(2223) 


60.6 


( 634) 


27 




2 


59 0 




60 7 




32 




3 


53.7 


^2282'^^*^ 


56 3 


( 609'i 


23 




Total 


58.0 


^5611^^** 


59.5 


(1862) 


82 


Aee 17 


1 


63.3 


n934^ lb] 








(taped) 


2 


63.4 


(1934) f^'l 






32 




3 


58.9 


^1934^ 






23 




Total 


62.1 


(3868) li'l 






82 


Age 13 


1 


' 52.5 


(2075) 


53.8 


(1405) 


25 


(taped) 


2 


54.2 


(2054) 


54.7 


(1281) 


31 




3 


56.2 


(2071) 


57.8 


(1256) 


27 




Total 


54.3 


(6200) 


55.5 


(3942) 


83 


Age 9 


1 


59.4 


(2315) 


62.6 


(1274) 


18 


(taped) 


2 


52.5 


(2361) 


53.5 


(1240) 


25 




3 


68.5 


(2256) 


69.0 


(1197) 


20 




Total 


59.5 


(6932) 


61.0 


(3711) 


63 



Age-only BIB sample with at least one science trend block. 
1986 age 17 trend sainple blocks 1 and 2 were paired. 
Includes some items that were excluded from IRT scaling. 



Ca] 
Cb] 
Cc) 



131 



139 



based on the item parameters estimated on the 1986 and 1988 data. The linear 
constants derived from those transformations were applied to the 1988 da\:a set 
to obtain trend points for 1988. For age 17, we applied an eqtiating method 
identical to that used for age 17 mathematics data. The conditioning 
variables and the estimated conditioning effects are given in Tables D.17 to 
D.19 (Appendix D) for all three ages. The linear coefficients used for the 
three ages are presented in Table 7.8. 

Table 7.8 

Coefficients of the Linear Transformation 
of the Trend Scale from Original Units 
to the Science Proficiency Scale 

Age Intercept Slope 

9 225.59 41.15 

13 254.19 36.92 

17 289.34 43.05 

The trends in mean proficiency witii jackknifed standard errors for 
subpopulations of the three age samples are listed in Tables 7.9 - 7.11. The 
1986 and 1988 posterior distributions of scieace proficiei.^y were calculated 
for each cohort separately at 40 quadrature points. Overlays of distributions 
from the two assessment years appear in Figures 7.5 through 7.7. For age 17, 
the 1986 distribution is calculate! on the 1986 bridge sample as well as on 
the age-only subsample of the 1986 BIB sample, which is comparable to the 1988 
bridge sample of age 17. The shape of the distributions of the two assessment 
years is quite similar for each cohort. 

In 1986, using the range of student performance on the NAEP science 
scale, five levels of science proficiency were established and described in 
detail in the Science Report Card (Mullis & Jenkins, 1988): Level 150- -Knows 
Everyday Science Facts, Leve.1 200- -Understands Simple Scientific Principles, 

132 

140 



Tabic 7.9 



Weighted Science Proficiency Means 
and Standard Errors, Age 9 



1977 1982 1986 1988 

gu^tgyptip Mean S.E. ^ean S.E. Mean S.E. Mean S.E. 

Total 219.9 ( 1.2)* 220.9 ( 1.8)* 224.3 ( 1.2)* 228.9 ( 1.3) 
Sex 

Male 222.1 ( 1.3)* 221.0 ( 2.3)* 227.3 ( 1.4) 232.1 ( 1.6) 

Female 217.7 ( 1.2)* 220.7 ( 2.0) 221.3 ( 1.4) 225.7 ( 1.6) 

Ethnicity 

White 229.6 ( 0.9)* 229.1 ( 1.9)* 231.9 ( 1.2)* 237.4 ( 1.3) 

Black 174.9 ( 1.9)* 187.1 ( 3.0)* 196.2 ( 1.9) 200.1 ( 2.5) 

Hispanic 191.9 ( 2.9) 189.0 ( 4.1) 199.4 ( 3.1) 201.0 ( 6.1) 

Other 214.4 ( 7.5) 222.8 ( 5.4) 220.6 ( 4.6) 229.7 ( 3.6) 

Grade 

<niodal 197.6 ( 1.6)* 197.5 ( 2.9)* 204.9 ( 1.6)* 213.8 ( 2.3) 

-modal 227.0 ( 1.2)* 230.7 ( 2.2) 234.3 ( 1.2) 236.2 ( 1.4) 

>niodal 243.9 ( 6.1) 265.9 (15.1) 235.0 (10.7)* 278.7 (13.3) 

Region 

Northeast 224.5 ( 1.6) 221.8 ( 2.7) 228.2 ( 3.5) 228.9 ( 3.8) 

Southeast 205 1 ( 3.0)* 214.0 ( 3.9) 218.8 ( 3.1) 223.7 ( 2.5) 

Central 225. b ( 2.2)* 226.3 ( 3.4) 227.9 ( 2.2)* 236.7 ( 2.9) 

West 220.9 ( 2.3) 219.9 ( 4.1) 222.1 ( 3.2) 226.8 ( 2.0) 



* Shows statistically significant difference from 1988, where 
o » .05 per set of three comparisons (each year compared to 1988). 



133 



Table 7.10 



Weighted Science Proficiency Means 
and Standard Errors, Age 13 



Subgroup 

Total 

Sex 
Hale 
Feaale 



Ethnicity- 
White 
Black 
Hispanic 
Other 

Grade 
< modal 
- modal 
> modal 

Region 
Northeast 
Southeast 
Central 
West 



1977 
Mean S.E. 



1982 
Mean S.E. 



1986 
Mean 



S.E. 



1988 
Mean 



S.E. 



247.4 ( 1.1)* 250.2 ( 1.3)* 251.4 ( 1.4)* 257.3 ( 0.9) 



251.1 ( 1.3)* 
243 8 ( 1.2)* 



256.1 ( 0.8)* 
208.1 ( 2.4)* 
213.4 ( 2.2)* 
235.1 ( 3.4)* 



223.4 ( 1.6)* 
2!6.0 ( 1.0)* 
284.7 ( 4.9) 



255.3 ( 2.4) 
235.1 ( 1.8)* 
253.8 ( 1.8) 
243.0 ( 2.3)* 



255.7 ( 1.5)* 
245.0 ( 1.3)* 



257.3 ( 1.1)* 
217.2 ( 1.3)* 
225.5 ( 3.9) 

262.4 (11.5) 



228.6 ( 1.6)* 
258.5 ( 1.3)* 
287.4 ( 8.0) 



256.1 ( 1.6)* 
246.9 ( 1.5)* 



259.2 ( 1.4)* 
221.6 ( 2.5)* 
226.1 ( 3.1) 
253.0 ( 4.0) 



234.2 ( 1.9)* 
259.8 ( 1.3)* 
266.4 ( 6.3)* 



262.2 ( 1.2) 
252.4 ( 1.0) 



265.2 ( 0.9) 
229.4 ( 1.2) 

229.3 ( 4.2) 
265.7 ( 5.2) 



242.2 ( 1.8) 

265.3 ( 0.7) 
304.2 ( 9.6) 



( 2.4) 


257.6 


( 


3.1) 


259 


.9 


( 


2.8) 


( 2.4)* 


247.1 


( 


2.2) 


253 


.1 


( 


3.0) 


( 2.4) 


249.4 


( 


5.3) 


259 


.2 


( 


1.6) 


( 3.0) 


252.3 


( 


2.7) 


256 


.9 


( 


1.7) 



* Shows statistically significant difference from 1988, where 
.05 per set of three comparisons (each year compared to 1988). 



ERIC 



134 

142 



Table 7.11 



Weighted Science Proficiency Means 
and Standard Errors, Age 17 



Subgroup 



1977 
Mean S.E. 



1982 
Mean S.E. 



1986 
Mean S.E. 



1986 (BIB) 
Mean S.E. 



1988 
Mean S.E. 



Total 



289.6 (1.0)* 283.3 (1.1)* 288.5 ( 1.4)* 288.5 (1.1)* 294.2 ( 1.5) 



Sex 
Male 
Female 

Ethnicity 
White 
Black 
Hispanic 
Other 



297.1 (1.2) 291.9 (1.4)* 294.9 ( 1.9)* 
282.3 (1.1) 275.2 (1.3)* 282.3 ( 1.5) 



297.7 
240.3 
262.3 



(0.7) 

(1.5)* 

(2.5)* 



284.4 (4.1) 



293.2 (1.0)* 
234.8 (1.7)* 
248.7 (2.4)* 
269.1 (4.9) 



297.5 ( 1.7) 
252.8 ( 2.9) 
259.3 ( 3.8)* 
276.8 (11.2) 



294.6 (1.4)* 302.5 ( 2.3) 
282.3 (1.2) 285.6 ( 1.9) 



296.1 (1.2)* 
254.9 (1.9) 



258.6 
288.9 



(2.0)* 
(9.9) 



301.9 ( 1.7) 
260.0 ( 3.4) 

281.8 ( 5.2) 

295.9 (11.6) 



Grade 
< modal 
- modal 
> modal 



253.2 (1.4)* 
295.0 (0.9)* 



250.8 (2.2)* 

288.9 (1.1)* 



300.8 (1.5)* 292.6 (2.6)* 



259.2 ( 2.7) 
294.0 ( 1.6)* 
298.6 ( 4.3)* 



256.7 (2.2)* 266.3 ( 2.9) 
296.7 (1.2) 300.6 ( 1.5) 
297.0 (2.8)* 317.0 ( 4.2) 



Region 

Northeast 296.4 (2.3) 

Southeast 276.4 (1.9)* 

Central 294.1 (1.6) 

Wesn 286.6 (1.6) 



284.4 (1.9)* 
276.3 (2.8)* 



289.3 
280.9 



(2.4) 
(2.7)* 



292.2 ( 4.3) 296.6 (2.0) 

283.5 ( 2.0) 279.6 (2.1)* 

294.4 ( 2.3) 291.1 (2.3) 

283.2 ( 3.8) 285.5 (2.3)* 



303.3 ( 4.5) 

288.0 ( 3.1) 

292.0 ( 3.1) 

294.2 ( 3.2) 



* Shows statistically significant difference from 1988, where a - .05 per 
set of four comparisons (each year compared to 1988) . 



143 



Figure 7.5 



Estimated Science Proficiency Distributions 
for the 1986 and 1988 Bridge Samples, Age 9 




100 200 300 400 

PROFICIENCY SCALE 



ERIC 



136 

144 



Figure 7*6 



Estimated Science Proficiency Distributions 
for the 1986 and 1988 Bridge Samples, Age 13 




w 



137 



145 



Figure 7.7 



Estimated Science Proficiency Distributions 
for the 1986 Age-only BIB Sample and the 1988 Age-only Bridge Sample 

Age 17 




0 100 200 300 ^00 500 



PROFICIENCY SCALE 



138 

146 



Level 250- -Applies Basic Scientific Infomation, Level 300- -Analyzes 
Scientific Procedures and Data, and Level 350- -Integrates Specialized 
Scientific Information. Table 7.12 shows the percentage of students at ages 
9, 13, and 17 who attained each level of proficiency in the 1978, 1982, 1986, 
and 1988 assessments. 

7.4 Major Findings for Mathematics and Science Trend Data 

The four main findings of the comparison of the 1988 and the 1986 trend 
samples for mathematics and science are as follows: 

1) For all three ages, the 1988 trend sample showed higher weighted 
mean proportions correct than the corresponding 1986 trend sample. 
This was true at the block level as well as for overall 
performance in mathematics and science. 

2) In terms of proficiency scale means of mathematics and science for 
the entire sample, tne 1988 sample's perfoinnance was superior to 
the comparable 1986 sample's performance. The improvements were 
statistically significant for all samples except the age 17 
mathematics sample (see Tables 7.3 - 7.5 and 7.9 - 7.11), note 
that the desired Type I error rate was divided by the number of 
contrasts using a Bonferroni approach. This was true for most of 
the subpopulation levels as well. The means and standard 
deviations for mathematics and science for all three ages since 
1969 are presented in Table 7.13. They are plotted in Figures 7.8 
and 7.9. 

139 

147 



The differences between paced administration aiad paper-and-pencil 
administration for age 17 in 1986 were not statistically 
significant for mathematics and science at any reporting 
subpopulation levels (e.g., gender, race/ethnicity, grade, or 
region) . 



Trends of mean proficiencies in mathematics and science closely 
parallel each other, strictly speaking, any direct comparison of 
the value of the proficiency mean4 in different subject areas and 
the changes in proficiency over time across subject areas has 
limited meaning. However, the shape and relative magnitudes can 
be compared across subject areas. The apparent large increases in 
mathematics and science proficiencies from 1986 to 1988 exist even 
though there was no context change in regard to the item order, 
composition, and mode and timing of presentation. 



140 

lis 



Table 7.12 



Science Trends for 9-, 13-, and 17 -Year-Old Students: 
Percentage of Students at or Above 
the Five Proficiency Levels, 1978-1988 



Proficiency Levels 

Level 150 
Knows Everyday 
Science Facts 



Assessment Year 
1977 1982 1986 1988 

A ge Mean S . E . Mean S . E . Mean S . E . Mean S . E . 



9 93.6 (0.5) 
13 98.6 (0.1) 
17 99.8 (0.0) 



95.0 (0.5) 

99.6 (0.1) 

99.7 (0.1) 



96.3 (0.3) 

99.8 (0.1) 

99.9 (0.1) 



97.3 (0.3) 
99.7 (0.1) 
100.0 (0.0) 



Level 200 9 
Understands Simple 13 
Scientific Principles 17 



67.9 (1.1)* 70.4 (1.6)* 71.4 (1.0)* 76.4 (0.9) 
85.9 (0.7)* 89.6 (0.7)* 91.8 (0.9) 93.4 (0.6) 
97.2 (0.2) 95.7 (0.4) 96.7 (0.4) 98.9 (0.4) 



Level 250 
Applies Basic 
Scientific Information 



9 26.2 (0.7)* 24.8 (1.7)* 27.6 (1.0) 31.2 (1.4) 
13 49.2 (1.1)* 51.5 (1.4)* 53.4 (1.4)* .59.0 (0.8) 
17 81.1 (0.7)* 76.8 (1.0)* 80.8 (1.2)* 80.4 (0.9) 



Level 300 
Analyzes Scientific 
Procedures and Data 



9 3.5 (0.2) 
13 10.9 (0.4) 
17 41.7 (0.8) 



2.2 (0.6) 3.4 (0.4) 3.9 (0.5) 
9.4 (0.6)* 9.4 (0.7)* 12.4 (0.7) 
37.5 (0.8)* 41.4 (1.4) 44.6 (1.9) 



Level 350 9 0.0 (0.0) 0.1 (0.1) 0.1 (0.1) 0.2 (0.1) 

Integrates Specialized 13 0.7 (0.1) 0.4 (0.1) 0.2 (0.1) 0.5 (0.1) 
Scientific Information 17 8.5 (0.4) 7.2 (0.4) 7.5 (0.6) 8.2 (1.0) 



* Shows statistically significant difference from li>68, where a - .05 
per set of three comparisons (each year compared to 1988) . (No significance 
test is reported when the proportion of students is either >95 .0 or <5 . 0 . ) 
Standard errors are presented in parentheses. 



141 

149 



150 



ERIC 



Table 7.13 

Trend of Proficiency Scale Means and Standard Deviations 
for Mathematics and Science 



1220 1923 1977 liZ8 1981 1986 1988 

Mean Mean Mean Mean Mean S.D. Mean S.D. Mean S.D. 



aS 17 304 8 9«0 « .«^ ^^""'^^^ ^^^.39) 302.0 (31.09) 305.4 (29.74) 

Age 1/ sci 304.8 295.8 289.6 (44.58) 283.3 (46.67) 288.5 (44.48) 294.2 (41.37) 

iZll 254 9 ^Iq"? 0A7A,,,,,^ 264.1 (38.99) 268.6 (33.36) 269.0 (30.84) 273.3 (31.74) 

Age 13 Sci 254.9 249.5 247.4 (43.11) 250.2 (38.65) 251.4 (36.63) 257.3 (37.20) 

^ 9 Math 219.1 218.6 (36.02) 218.0 (34.80) 221.7 (33 98) 299 0 nQ) 

Age 9 Sci 224.9 220.3 219.9 (44.88) 220.9 (40.93) 224:3 (1111] ^8 9 [li fe] 



parentheses . 



Note: 1970-1973 results are interpolated backward; standard deviations of proficiency are in 



V 151 



Figure 7*8 

Trend of Proficiency Scale Means for Mathematics 
1973 - 1988* 

500 




210- 



200- 



0 73 78 82 86 88 

Year of Administration 



* 1973 results were intexpolated for thl5 "^lot. Bands extend fzca two standard 
errors below to two standard errors above the icean. 





Mathematics Scale Means and Standard Errors 


Year 


Axe 9 (S.E.) 


Axa 13 (S.E.) 


A«« 17 (S.E.) 


1973 


219.1* 


266.0* 


304.4* 


1978 


218.6 (0.8) 


264.1 (1.1) 


300.4 (0.9) 


1982 


219.0 (1.1) 


268.6 (1.1) 


298.5 (0.9) 


1986 


221.7 (1.0) 


269.0 (1.2) 


302.0 (0.9) 


1C88 


229.0 (1.1) 


273.3 (0.8) 


305.4 (1.2) 



143 

152 



Figure 7.9 



Trend of Proficiency Scale Means for Science 
1969 - 1988* 



500 



310- 




210- 



200- 

^ 

* * ' ' * ' » ! 1 » 1 1 1 1 i 1 I p- 

0 69 70 75 77 82 36 88 

Ytar of AdmF'/iIstratlon 



* 1969, 1970, and 1973 results were interpolatad for this plot. Bands extend from 
two standard errors below to two standard errors above the mean. 



Science Scale Menns and Standard Errors 

Year Age 9 (S.E.) Arg 13 fS.E.) Axe 17 (S.E.I 

1969 — — 304.8* 

1970 224.9* 254.9* 

1973 220.3* 249.5* 295 8* 

1977 219.9 (1.2) 247.4 (1.1) 289.6 (1.0) 

1982 220.9 (1.8) 2:^0.2 (1.3) 283.3 (1.1) 

1986 224.3 (1.2) 251.4 (1.4) 288.5 (1.4) 

1988 228.9 (1.3) 257.3 (0.9) 294.2 (1.5) 



144 



ERIC 



153 



Chapter 8 
ITEM-BY-FORM VARIATION 
IN 1984 AND 1986 NAEP READING SURVEYS^ 



Robert J. Mislevy 



8 . 1 Introduction 

The 1984 and 1986 NAEP reading surveys employed overlapping sets of test 
items, but administered those items in forms that differed in length, 
composition, timing, and administration conditions. As discussed elsewher** in 
this^ report, it has been hypothesized that the main effects of such changes 
were responsible to some degree for the anomalous results observed in 1986; 
that is, the cumulative effect of such changes caused the assessment in a 
particular age/grade to become easier or harder, leading to the large, and 
frankly, unbelievable, differences initially observed between the 1984 and 
1986 percent-correct results. This chapter investigates the magnitudes of 
item-by-form variation, above and beyond main effects. 

While the primary investigations focus upon main effects, this ancillary 
study capitalizes upon the bridge data to highlight a key issue in instrument 
design. To anticipate, we find that modifying assessment forms can cause the 
accuracy of measures of change to plummet, and that similar magnitudes of this 
extraneous variation were found in the 1984-86 period and historical NAEP 



^1 am grateful to Nancy Allen, Al Beaton, Gene Johnson, John Tukey, and 
Rebecca Zwick for discussions and comments on the analyses described here. 

145 



reading assessments. The results support maintaining absolutely identical 
portions of instruments between successive time points to measure change, 
while introducing innovations in other portions. 

What follows is an analysis of the components of variance of item 
percents-correct, which were the basis of NAEP trend reports prepared under 
the aegis of the Education Commission of the States. These procedures 
illuminate, without the complexities of the scale-score methodology, the same 
sources of variation that affect NAEP scale-score reports prepared under the 
aegis of Educational Testing Service. In the spirit of generalizability 
theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972), estimates of variance 
components for items-by- test-forms, items -by-time, and student-sampling shed 
light upon the sources of uncertainty in NAEP estimates of change, their 
relative magnitudes, and their likely effects under alternative assessment 
designs. 

8.2 Backgroimd 

The item percents-correct from initial analyses of the 1986 NAEP reading 
survey indicated sudden declines in average proficiency for 9- and 17 -year- 
olds in the two-year period since 1984 that exceeded the largest changes ever 
seen in NAEP's history of comparisons over four-, five-, and six-year periods. 
A number of analyses were carried out with the 1986 data to check hypotheses 
about mechanisms that could have led to spurious declines of this magnitude 
(Beaton, 1988a). No proximate cause was identified in those analyses. 

Other hypotheses could not be checked with those data, however, 
including the possibilities of effects due to the rearrangements of items on 
test forms , changes in administration procedures , changes in time allocations , 

146 

155 



and response modes. Changing the response mode from circling a correct answer 
to filling in a bubble, for example, or tightening time limits, could tend to 
make all items in the assessment more difficult. Changing an item's position 
from the beginning to the end of a block, changing the print size so that 
hyphenations appeared in differ'^nt words, or moving the item from the context 
of easier to that of harder items, could tend to make it appear more difficult 
relative to other items. 

To investigate the possibility of cumulative effects of these types, an 
experiment was ftmbedded in the 1988 NAEP reading survey. At each age/grade, 
randomly equivalent samples of students were administered reprcjsentative 
booklets from the 1984 survey and the 1986 survey. Each "bridge sample" 
survey was carried out with timings and administration conditions that matched 
the actual 1984 or 1986 conditions as closely as practical. The average 
difference over items between the two bridge samples estimates the main effect 
of changes in forms and administration procedures, as discussed in Chapter 5. 
The item«by-form variation between th^^ bridge samples quantifies the magnitude 
of changes in relative item difficulties, above and beyond the main effect. 
These latter variations are the subject of the present chapter. 



8.3 The Data 

Data were obtained at each age, in the form of percents of correct 
responses to all the multiple -choice items that appeared in both the 1988 
bridges; that is, bridges to the 1984 and 1986 assessments. There were 26 
such items at age 9, 19 at age 13, and 23 at age 17. This collection is a 
subset of all the items that appeared in both the full 1984 and full 1986 
instruments. The single professionally srored, open-ended item common to the 

147 

ERiC 156 



bridges was not included due to possible scoring inconsistencies. The average 
number of students responding to the items in each Age, Form, Time-Point 
combination are shown in Table 8.1. The key features of the four samples at 
each age are as follows: 

Sample Name Test Form Student Population 

1984 1984 form 1984 students 

1986 1986 form 1986 students 

1984 b 1984 form 1988 students 

1986 b 1986 form 1988 students 

The item percents- correct, or "item p's," are shown in Tables 8.2 
through 8.4. Because the data were obtained under complex sample designs, 
student responses were weighted in accordance with the students' selection 
probabilities. (The revised posts tratification weights described in Appendix 
C have been used here with the age 9 and age 13 data, thereby eliminating one 
factor that contributed to the anomalies originally seen in 1986.) 

8.4 A Model for Item-Ps 

The vari;»nce- components analysis is based on a linear model for item- 

p's: 



m 1 if it ift 



where 



Pift is t:he item-p for Item i on Form f from the sample of students at Time- 
point t. 

Ml is the population average for Item i over forms and time points. 



ERLC 



148 

157 



Table 8.1 

Average Sample Sizes for Item Percents- Correct 
Assessment Year 



Test Form Ape 1984 1986 1988 

[1984 sample] [1984b sample] 

9 1972 - 598 

1984 13 2208 - 657 

17 2381 - 604 

[1986 sample] [1986b sample] 

9 - 2102 1135 

1986 13 - 1911 1280 

17 - 1901 867 



ERIC 



149 

. 15,8 



Tabl(j 8.2 



NAEP Bridge Study Weighted Percents-correct, Age 9 



Item 


1984 


1986 


N001501 


0.792 


0.765 


NO01502 


0.517 


0.518 


N001503 


0.671 


0.659 


N00i504 


0.587 


0.586 


N002001 


0.293 


0.342 


N002002 


0.362 


0.377 


N002003 


0.405 


0.423 


N0028Q.1 


0.536 


0.505 


N002802 


0.549 


0.572 


N003101 


0.534 


0.528 


N003102 


0.347 


0.376 


N004101 


0.664 


0.597 


N008601 


0.654 


0.618 


N008602 


0.562 


0.540 


N008603 


0.625 


0.589 




U , O/o 


U. /09 


N008902 


0.697 


0.713 


N009401 


0.731 


0.732 


N009801 


0.918 


0.913 


N0i0201 


0.860 


0.813 


N010301 


0.838 


0.827 


N010401 


0.700 


0.646 


N010402 


0.361 


0.376 


N010403 


0.250 


0.222 


N013301 


0.811 


0.853 


N014201 


0.667 


0.610 


Average 


0.600 


0.593 


Variance 


0.031 


0.029 


Std,Dev. 


0.177 


0.170 



1984b 1988b 86-84 

0.833 0.825 -0.027 

0.544 0.555 0.001 

0.704 0.691 -0.012 

0.609 0.615 -0.001 

0.299 0.362 C.049 

0.391 0.397 0.015 

0.453 0.433 0.018 

0.580 0.549 -0.031 

0.600 0.609 0.023 

0.611 0.598 -0.006 

0.330 0.429 0.029 

0.691 0.619 -0.067 

0.675 0.653 -O.C?S 

0.574 0.560 -0.:"'2 

0.667 0.631 ••0.03'i 

0.695 0.754 0.031 

0.684 0.767 0.036 

0.759 0.760 O.COl 

0.895 V.B99 -0.005 

0.876 6.537 -0.047 

0.814 0.803 -0.011 

0.701 0.653 -0.054 

0.370 0.371 0.015 

0.280 0.259 -0.028 

0.844 0.868 0.042 

0.681 0.6.'57 -0.057 

0.622 0.621 -0.008 

0.030 0.028 0.00095 

0.174 0.167 0.03084 



84b -84 86b-86 8 6b -84b 

0.041 0.060 -0.008 

0.027 0.037 0.011 

0.033 0.032 -0.013 

0.022 0.029 0.006 

0.006 0.020 0.063 

0.029 0.020 0.006 

0.048 0.010 -0.020 

0.044 0.044 -0.031 

0.051 0.037 0.009 

0.077 0.070 -0.013 

-0.017 0.053 0.099 

0.027 0.022 -0.072 

0.021 0.035 -0.022 

0.012 0.020 -0.014 

0.042 0.042 -0.036 

0.017 0.045 0.059 

-0.013 0.054 0.083 

0.028 0.028 0.001 

-0.023 -0.014 0.004 

0.016 0.024 -0.039 

-0.024 -0.024 -0.011 

0.001 0.007 -0.048 

0.009 -0.005 0.001 

0.030 0.037 -0.021 

0.033 0.015 0.024 

0.014 0.047 -0.024 

0.021 0.029 -0.000 
0.00055 0.00047 0.00150 
0.02336 0.02168 0.03868 



159 



150 



Table 8.3 



NAEP Brji^e Study Weighted Percents- correct, Age 13 



Item 


1984 


1986 


N001501 


0.934 


0.958 


N001502 


0.785 


0.785 


N001503 


0.841 


0.887 


N001504 


0.805 


0.819 


NOO20O1 


0.621 


0.668 


NOO20O2 


0.668 


0.671 


NOO20O3 


0.733 


0.759 


N002801 


0.854 


0.892 




U • 07D 


U. 911 


NOO30O1 


0.301 


0.251 


NOO30O3 


0.095 


0.098 


N003101 


0.845 


0.834 


N003102 


0.746 


0.782 


N004601 


0.580 


0.592 


N004602 


0.687 


0.655 


N004603 


0.818 


0.764 


N005001 


0.221 


0.240 


N005002 


0.352 


0.446 


NOO5003 


0.215 


0.212 


Average 


0.631 


0.644 


Variance 


0.065 


0.067 


Std.Dev. 


0.255 


0.258 



1984b 1986b 86-84 

0.975 0.940 0.024 

0.791 0.805 0.000 

0.900 0.885 0.046 

0.847 0.821 0.014 

0.647 0.682 0.047 

0.644 0.722 0.003 

0.717 0.784 0.026 

0.881 0.913 0.038 

0.908 0.922 0.016 

0.338 0.265 -0.050 

0.060 0.100 0.003 

0.841 0.844 -0.011 

0.744 0.786 0.036 

0.571 0.613 0.012 

0.644 0.704 -0.028 

0.785 0.838 -0.054 

0.238 0.243 0.019 

0.367 0.439 0.094 

0.222 0.214 -0.003 

0.638 0.659 0.012 

0.068 0.069 0.00114 

0.261 0.262 0.03374 



84b -84 86b-86 86b -84b 

0.041 -0.018 -0.035 

0.006 0.020 0.014 

0.059 -0.002 -0.015 

0.042 0.002 -0.026 

0.026 0.014 0.035 

-0.024 0.051 0.078 

-0.016 0.025 0.067 

0.027 0.021 0.032 

0.013 0.011 0.014 

0.037 0.018 -0.069 

-0.035 0.002 0.040 

-O.OOA 0.010 0.003 

-0.002 0.004 0.042 

-0.009 0.021 0.042 

-0.043 0.045 0.060 

-0.033 0.074 0.053 

0.017 0.003 0.005 

0.015 -0.007 0.072 

0.007 0.002 -0.008 

0.007 0.016 0.021 

0.00078 0.00045 0.00149 

0.02786 0.02122 0.03861 



151 

er|c ' ^60 



Table 8.4 



NAEP Bridge Study Weighted Percents- correct, Age 17 



Item 


1984 


1986 


1984b 


1986b 


N001501 


0.964 


0.941 


0.974 


0.951 


N001502 


0.888 


0.846 


0.916 


0.841 


N001503 


0.919 


0.893 


0.939 


0.865 


N001504 


0.900 


0.869 


0.911 


0.830 


N002001 


0.777 


0.721 


0.775 


0.729 


N002002 


0.811 


0.755 


0.808 


0.799 


N002003 


0.858 


0.831 


0.861 


0.823 


N002801 


0.947 


0.927 


0.936 


0.923 


N002802 


0.962 


0.938 


0.949 


0.932 


N003001 


0.468 


0,387 


0.462 


0.397 


N003003 


0.223 


0.302 


0.202 


0.270 


N003101 


0.936 


0.872 


0.917 


0.896 


N003102 


V/ . OOJ 


\J • ODD 


n QQC 
U • OOD 


0.865 


K004601 


0.704 


0.681 


0.689 


0.699 


N004602 


0.795 


0.791 


0.826 


0.814 


N004603 


0.877 


0.856 


0.890 


0.868 


N003201 


0.912 


0.881 


0.935 


0.874 


N003202 


0.851 


0.810 


0.829 


0.823 


N003203 


0.736 


0.701 


0.778 


0.693 


N003204 


0.836 


0.795 


0.869 


0.807 


N005001 


0.393 


0.417 


0.452 


0.401 


N005002 


0.508 


0.547 


0.519 


0.546 


N005003 


0.314 


0.288 


0.284 


0.327 


Average 


0.759 


0.735 


0.766 


0.738 


Variance 


0.046 


0.040 


0.048 


0.040 


Std.Dev. 


0.215 


0.200 


0.218 


0.200 



86-84 84b -84 86b-86 8 6b -84b 



-0.023 
-0.042 
-0.026 
-0.031 
-0.056 
-0.056 
-0.027 
-0.020 
-0.024 
-0.081 
0.079 
-0.064 
-0.029 
-0.023 
-0.004 
-C.021 
-0.031 
-0.041 
-0.035 
-0.041 
0.024 
0.039 
-0.026 



0.010 
0.028 
0.020 
0.011 
-0.002 
-0.003 
0.003 
-0.011 
-0.013 
-0.006 
-0.021 
-0.019 
0.001 
-0.015 
0.031 
0.013 
0.023 
-0.022 
0;042 
0.033 
0.059 
0.011 
■0.030 



0.010 
-0.005 
-0.028 
-0.039 
0.008 
0.044 
-0.008 
-0.004 
-0.006 
0.010 
-0.032 
0.024 
0.009 
0.018 
0.023 
0.012 
-0.007 
0.013 
-0.008 
0.012 
-0.016 
-0.001 
0.039 



-0.023 
-0.075 
-0.074 
-0.081 
-0.046 
-0.009 
-0.038 
-0.013 
-0.017 
-0.065 

0.068 
-0.021 
-0.021 

0.010 
-0.012 
-0.022 
-0.061 
-0.006 
-0.085 
-0.062 
-0.051 

0.027 

0.043 



-0.024 0.006 0.003 -0.028 



152 



ERJC 



161 



a^^ is an effect for Item 1 specific to Form f . If, for a fixed f , the as 
have a mean of zero, there Is no .main effect such as the ones suspected 
In the anomaly, A nonzero mean is an (undesired) form effect. 

is an effect for Item 1 specific to the Time t population. Nonzero 
means for different values of t are the mean differences over time that 
the assessment is intended to measure. 

t^^^ is an error term specific to Item i on Form f for Time t. We assume 
that these terms have means of zero, and are independent over items, 
time-points, and forms. (This independence is the only assumption in 
this model.) 

Associated with each term is a variance component. The components 
relevant to present purposes are those for a, ^8, and c: 

o\ is item-by-form variance. Insofar as measuring change is concerned, 
this is noise. Its deleterious effects are not reduced by increasing 
the number of students in the sample, and, as we shall see, it can come 
to dominate the variance of estimates of change. These effects can be 
reduced to zero by maintaining identical forms and administration 
conditions . 

o\ is item-bv-time variance. It arises because the items in a content area 
become easier or harder by different amounts. Its impact on the 
uncertainty of change can be reduced in two ways: by increasing the 
number of items in a subject area, and by reporting in subject areas in 
which items are likely to exhibit similar changes over time. 

153 

162 

ERIC 



(flit is .sampling variance within Form f and Time -point t. It arises from the 
fact that only a saqple of the population is surveyed, ar»d can be 
reduced by increasing the student sample size. In this report we shall 
denote the sampling variance in the four ssinples involved in the study 
as af^, ale, ag^j,, and cj^. 

We suppose that item-by-time and item*by-form variances are similar in 
magnitude at all time points and over forms, respectively, but allow sampling 
variances to differ in different assessments because NAEP examinee sample 
sizes often vary considerably. 

An estimate of change from time point A to time point B, if the same 
form F is used at both time points, is the average over items of n item-p 
differences, where n is the number of items. Its expectation (over items i) 
is 

an unbiased estimate of the true average change if the items have been 
selected at random, and its variance is 

2 2 2 

I 2a + a + (7 ] / n . 

If change from Time A to Time B is estimated using two different forms, F and 
G, the expected difference over it^ims is 



154 



ERiC 163 



which confounds change over time with difference in form. It is an unbiased 
estimate of the true average change only if the average item-by- form effect is 
zero. Its variance is 

r ^ 2 ^ 2 2 2 

[2a«f2c7 + or +a ]/n. 

t f ^FA ^FB 

Note that even if itera-by-form interactions are zero on the average, the 
sensitivity of differences in average percent-correct as a measure of change 
is degraded by the additional term 2a}. 

Although our focus is on variance compoi.ants, we should mention the 
relevance of the preceding paragraph to the anomaly. The items in this study 
are only a little more than half of those that appeared in common between 1984 
and 1986, but the anomaly is reflected clearly in age 17 data (Table 8.4), 
where the 1986 mean item-p lies .024 below the 1984 mean. This difference may 
seem small from the perspective of measuring individuals, but it is larger 
than the chanfje in means between any two previous assessments over time spans 
two to three times as long. If one ignores item-by- form and item-by- time 
variance when gauging the statistical significance of this difference (as was 
traditionally done in NAEP) , the resulting t-statistic for change is about -8; 
comparable values for the longer time spans in the past rarely exceeded 2 in 
absolute value. ^ That the mean of the 1986 bridge sample lies .028 below the 
mean of the 1984 bridge sample --two randomly equivalent samples from the 1988 
population- -suggests, however, that the 198^ to 1986 drop could be due in part 
to a difference in test forms. 



^ A more appropriate error term, also based on Table 8.4, is the standard 
deviation of the item-by- item differences between 1986 and 1984, divided by the 
square root of the number of items, or .033/723 - .007; this gives a t-statistic 
of -.024/. 007 - -3.5. 

155 



164 



8.5 Estimating Variance Components 

The variance components introduced above can be estimacsd from the data 
in Tables 8.1 through 8.4 in the following way: 

SteE_l. Approximate a^^, a^g, c^84b» ^^^^ ^86b using item-p's, sample sizes, and 
design effects. The sampling variance for a particular p^f^ is approximately 

P (1-P ) 

ift ift_ 

N /deff 
ift 

where N^£^ is the sample size upon which p^^^ is based and deff is a design 
effect that acts to increase the variance estimate due to the complex sample 
design. Since N^f^ 'values varied little across items for fixed f and t, the 
average value was used in this report for all N^f^.s in a given sample. Based 
on the studies summarized by E. G. Johnson (1987a), a design effect of 1.5 was 
employed for all items and all samples. In each sample, the average sampling 
variance over items was used as if it applied to all items. ^ The resulting 
values are as follows: 



Age 










9 


16 


15 


51 


27 


13 


11 


13 


37 


18 


17 


9 


11 


33 


27 



Note: all entries nmltiplied by 10^; e.g., 9 means .00009. 



While it might be preferable to transform item-p's to arcsins to better 
justify the use of a single sampling variance value, the average of varying 
sampling variance values for untransformed item-p's was employed for ease of 
computation and, we hope, comprehensibility. 

156 



ERLC 



165 



step 2 . Compute the variances among differences between the item-p's for 
given items in selected pairs of samples. The expectations of the variances 
of these differences can be expressed as functions of the variance components 
of interest: 



2 2 2 

Var(p - p ) - 2c + a + a (1) 

85b 84b t 8Sb 84b 



2 2 2 

Var(p - p ) - 2a + a + a (2) 

86b 86 t 86b 86 



2 2 2 

Var(p - p ) - 2a + a + a (3) 

84b 84 t 84b 84 



2 2 2 2 

Var(p - p ) - 2a + 2a + a + a (4) 

86 84 ft 86 84 



Step 3 . Replacing estimated error variances from Step 1 and observed 
variances among item-p differences into the formulas in Step 2, solve for a| 
and a|. This is done separately for each age. Equations 2 and 3 both yield 
approximations of a|; the average is also reported. Equation 1 yields an 
approximation of a?, which is reported. Substituting the estimate of a| into 
Equation 4 yields a second approximation of aj, which is also reported. 



157 

166 



8.6 Results 

The estimates of variance components are shown below. 

















Age 


Eq 2 


Eq 3 


Ave 


Eq 1 


Ea 2-4 


Ave 


9 


3 


-6* 


-1 


36 


34 


35 


13 


7 


15 


11 


47 


34 


41 


17 


2 


4 


3 


47 


42 


44 


Average 






4 






40 


Hote: 


all entries zzniltiplled by 10^; e.s*i 


3 means 


.00003. 





*The estimated value ol -6 x XO"^ has been carried through for the purpose of 
averaging , although v«u:iances must, of course, be nonnegative. 

Note uhat item-by-form variances, which are avoidable, dwarf item-by- 
time variances, which are not, roughly by a factor of ten. Also, recall that 
the variance of change in average item-percent^s correct are sums of components 
for sampling variance, item-by-time variance, and, if different forms are 
used, item-by-form variance. 

Using the sampling variance figures from the 1984 assessment, we can 
compare the total variance that might be expected when comparing percents- 
correct from 1984 to 1986 had the same form been used with the same sample 
size at both occasions, with the total variance that might be expected with 
the different forms that were actually employed. The total variance using 
differe.'tt forms, each comprising the same n items. Is modelled as 

2 2 2 

2 ( a + (7 + (7 )/n ; (5) 
84 t f ^ ' 



158 

167 



The total variance using the same form, comprising, say, n* items, is 



2 ( (7 + (7 )/n* . 



(6) 



84 t 



Using averages over ages of variance component estimates, we obtain 
(10 + 4 + 40)/n and (10 + 4)/n* respectively for (5) and (6) • These values 
are equal when n/n - (10 + 4 + 40)/(10 + 4) - 3.86. Thus, even in the absence 
of main effects for test forms (i.e., no "anomalies"), it takes about four 



methodologies that differ as little as the 1984 and 1986 reading surveys, 
compared to using the same methodology at both occasions. 

From another perspective , we can ask how many fewer respondents would be 
needed to achieve the same precision when forms are kept the same, compared to 
when they are different. To answer this question, we again work with the 1984 
BIB error variance, and denote the student sample size by N. The modelled 
variance when different foms are used is as follows: 



A comparison based on the same forms with an examinee sample size of N* and 
the sams design effect would have the following modelled variance: 



times as many items to get the same accuracy for meastiring change using 



2 2 2 

2 (a + a + a )/n . 

8« t £ 



(7) 



2 [(N/N*) al, + al J/n . 



(8) 



159 



With the average values 10, 4, and 40 for af^, ag, and respectively, (7) is 
equal to (8) when N/N*«5. This can be interpreted as saying the respondent 
sainple jmist be five times as large to achieve the same precision ^or measuring 
change vith different forms, compared to what is required when using the same 
forms at both occasions* 

8.7 A Quick Comparison with Paced Presentations 

Prior to 1984, NAEP reading assessments were conducted with tape 
recordings that paced students through their survey forms with controlled 
allocations of time for each item. The order of items and the length of the 
surveys was allowed to vary from one assessment year to the next. Time 
allocation under the present BIB-spiraling conditions is controlled only at 
the level of blocks of items approximately 15 rainutes in length. In order to 
get a feel for the combined extent of item-by- time and item-by- form 
interactions in paced- administration data, item-p's were examined for 20 items 
at each age that appeared in NAEP in the 1975 and 1980 assessments. 

As 5.n Equation 4, the variance among item-p's across two assessment 
years with different paced forms confounds item-by-form and item-by-time 
interactions. There is a five-year difference between the 1975 and 1980 NAEP 
assessments, so we compared these item-p difference variances with the (1986b- 
1984) differences from the bridge study discussed above, which had a four-year 
time span; this ensures that the item-by- time components of the BIB and the 
paced total variances will be similar. The total variances in item-p' s at 
ages 9, 13, and 17 were .00130, .00072, and .00092 for the (1986b-1984) BIB 
data, and .00067, .00094, and .00247 for the (1976 3980) paced data. 



160 



169 



The comparable magnitudes cf the BIB and paced total variances suggests 
that controlling the certain key aspects of the local environment of items 
(e.g., time allotted for a given item) in the paced format, but not others 
(e.g., location in assessment booklet, preceding exercises) did not produce 
significantly lower item-by-form variances. That is, the item-by-rnrm 
variance noted above in BIB is undesirable and largely avoidable, but it does 
not represent a great increase over variances of the same kind that appeared 
to have existed under paced administration in earlier NAEP reading surveys. 
It may be the case, however, that controlling item-level timing and 
administration conditions succeeded to a larger degree in avoiding form main 
effects, so that anomalies like the one seen in 1986 did not arise.* 

8.8 Assessment Versus Individtial Measurement 

In view of the major impact of item-by- fom variation upon the 
sensitivity of an assessment instrument, one wonders why such effects were not 
anticipated and avoided, or at least incorporated into standard errors for 
estimates of change, since NAEP's inception. One reason may be that effects 
of exactly the same size often truly ars negligible in the setting of 
individual measurement, the arena in which the "common wisdom" about 
educational measurement has for the most part accumulated. 

Consider measuring an individual studan^ when alternative test forms 
exhibit the magnitude of item-by-form variance detected between the 1984 and 
1986 NAEP forms, about .00040 at the level of the individual item As in the 

* Because different factors can contribute to form main effects and item- 
by- form interaction, if this could have been measured separately, finding similar 
magnitudes of item-by-form interaction in BIB and paced assessments would not 
address the question of whether anomalies qua form main effects occurred in the 
past. 

161 

170 




assessment setting, the measure is imperfect for two reasons: item-by-form 
variation and samplin;; variation. In individual measurement, sampling 
variance at the item level is driven by observing but a single binary 
response; on an item with an item-p of .7, this value is .7 x.3 - .21 for a 
typical student. Adding item-by- form variance of .0004 increases total 
variance beyond sampling variance by less than two- tenths of one percent. In 
contrast, the item- level sampling variance component of the 1984 and 1986 NAEP 
assessments was driven by securing responses from s^me 2,000 students, 
producing a value of about .00010. Adding to this the item-by-form variance 
of .00040 increased total variance beyond sampling variance by four hundred 
percent. In this example, a researcher interested in Individual measurement 
could safely ignore an item-by-form variance that would devastate precision in 
an assessment. (Sheehan & Mislevy, 1988, demonstrate similar effects in the 
setting of item response theory.) 

8.9 Conclusion 

Item-by- form interactions were detected in analyses of percents -correct 
of items that appoarad in the 1984 and 1986 NAEP reading assessment 
instruments and samples of students administered 1984 and 1986 test forms in 
1988. A quick look at historical results from previous paced NAF? reading 
surveys suggests that the 1C64/1986 BIB item-by-form interactions, while 
undesirable and largely avoidable, are about the same size as corresponding 
effects in past NAEP assessments under paced administration. 

This variation degrades the sensitivity of trend analyses. The itein-by- 
form interactions observed in NAEP data would be negligible for comparing 
individual examinees or tracking an indivif'ual ' c performances over time, but 



162 

171 



they are large from the perspective of estimating population changes. The 
magnitudes of the item-by- form variances detected in the 1984 and 1986 NARP 
assessments had effects comparable to cutting the number of items to one* 
quarter or the examinee sample size to one- fifth • 

Item-by-form interactions merely reduce efficiency (albeit possibly 
dramatically) as long as their average effects are zero. Nonzero averages, on 
the other hand, can invaiicdte the data totally for comparing performance 
levels over time. It may be that controlling item- level timing and 
administration conditions in past NAEP assessments helped to minimize the form 
main effects chat can cause anomalies such as the one observed in 1986. The 
corresponding step that is now beJ.ng taken under BIB procedures is to hold 
some proportion of timed blocks identical across successive assessments, with 
respect to composition, timing, and administration conditions. 



163 




Chaptzer 9 
EPILOGUE 



Albert E. Beaton 



The study of the 1988 bridges shows that the effect of changing 
measurement instruments can be so large that it obscures real changes in 
educational performance. This leads us to repeat the major lesson from the 
reading anomaly that was stated in Chapter 1: When measuring change^ do not 
change the measure. The empirical evidence to support the wisdom of * lis 
lesson is clear enough from the results of the analyses of the measurement 
system changes incorporated in the 1988 bridge samples, which are summarized 
in Chapter 4. The w<>rk by Mislevy, shown in Chapter 8, presents further 
evidence by computing item-by-form interactions and showing that the amount of 
variance created by changing assessment forms may be substantially greater 
than the variance over time of student performance, which we are attempting to 
measure. As Mislevy shows, this variance was present even when assessment 
iteips were individually timed using a tape recorder. The lesson is clear: 
Changes in trend assessment methodology are fraught with danger and should be 
undertaken only with great care. 

The pressure to make changes in ass ssment instruments and procedures is 
considerable. NAEP's complex consensus process involves hundreds of staff and 
advisors, many of whom have suggestions about how NAEP can be improved. Most, 
if not all, of these suggestions have merit. For example, committees of 
teachers reviewing items may make suggestions as to how to make individual 



165 



ERIC 




items more precise. A printer may suggest ways to improve the artistry of a 
booklet. And yet, for measuring trends, these suggestions must be rejected, 
since they might render the various assessment years incomparable. 

Defending the previous assessment procedures is not always easy. NAEP 
has been measuring trends since 1969, and there have been substantial changes 
in curriculum since then. For example, some formerly emphasized topics from 
the "new math" are no longer taught, and the pencil -and- paper computation of a 
square root has been de- emphasized. Over the years, NAEP has carefully 
removed items on such topics. Today, many believe that students should have 
different proficiencies, such as knowing how to use a scientific calculator. 
NAEP has already introduced calculator items and will use scientific 
calculators with the 1990 cross -sectional sample. Never changing the 
measurement instruments would surely make NAEP grow obsolete and 
uninteresting. 

The tension between continuity and change is not unique to NAEP or to 
educational measurement. For example, as United States corporations merge, go 
private, or fail, the Dow Jones average must change its composition while 
maintaining as much continuity in interpretation as possible. Government 
indices such as the Consumer Price Index must also adjust to changes in 
popular consumption. Such changes can never be made without introducing some 
change in the properties of the indicator, yet the changes are necessary to 
keep the indicator relevant. 

We believe that the proposed adjustments to the NAEP design are a 
prudent response to the conflicting goals of measuring trends and using up-to- 
date and relevant measurement. For now, we will maintain separate trend 
samples in xdiich the measurement instruments and procedures are as close as 

166 



174 



possible to those used in the assessment with which the new data will be 
compared. Separate samples of students will be measured using the most 
current information about the subject area and innovative technology. Only 
after their properties at d their relationship to the trend lines are fully 
understood will assessment forms and technology move from these innovative 
samples to the trend samples. 

Investigation of the reading anomaly has reinforced the realization that 
no measurement is perfect, especially the measurement Qf changes over time. 
Despite applying the best available measurement technology, subtle changes in 
the relevance of items and small shifts ir the school populations both 
introduce interpretive difficulties into comparisons with the past. Even 
holding the measurement system ponstant does not assure that changes in 
instruction and the form of learning will not affect the meaning of trends. 
Sampling error and other, inestimable types of error also affect the accuracy 
of trend estimates. The public as well as the measurement community should 
understand the difficulties and limitations of measurement --in education as in 
economics, in science, or in technology. 

Despite what has been presented about the limitations of assessment, it 
is important to note that a national assessment is still useful, indeed 
indispensable, if we expect to make decisions about the path that American 
education should take in the future. Educational policy makers and the public 
want to know if there have been major shifts in educational performance, and 
NAEP is the best instrument we have found for measuring such ctanges. This 
study of the rsading anomaly shows that it is inappropriate to overinterpret 
small shifts in performance that occur in a short period of time; such small 
shifts might be attributable to the various errors- -only some of whose sizes 



167 



ERIC 




can we estimate directly- -that affect an estimftte. In interpretirg small 
changes, it is usually prudent to repeat the measurement procedure over time 
until the shift stabilizes as a trend or is corroborated through other 
sources. Although the standard error attributable to measurement may be large 
compared to the changes in average performance that has been obsi rved over 
years , the standard error of a proficiency mean is quite small compared co the 
total variability of student performance. Put another way, the standard error 
is small compared to the difference between adjacent anchor points on the NAEP 
scales, and these anchor points represent substantial differences in student 
performance!. We have little doubt that, even in the short term, NAEP would 
reliably identify major shifts in educational performance, as it is intended 
to do. In a longer term, as it is also intended to do, it will reliably 
identify the cumulative effect of more or less consistent trends that are 
small in the short term. 

Finally, although we int^.nd to minimize changes in the assessment 
technology used for trend estimation, we also feel strongly that experimenting 
with and eventually introducing newer technology is essential for NAEP. The 
history of science is brimming with improvements in measurement that have 
resulted in better understanding of the world around us. Study of the reading 
anomaly has given us a fuller understanding of the strengths and weaknesses of 
the present NAEP assessment technology. The identification of technological 
limitations always presents a challenge for methodological improvement. 



Anchor points are used to describe what students at various levels of the 
NAEP scales know and can do. They are described in the reports in which the 
various scales are discussed and in the NAEP technical reports (Beaton, 1987, 
1988b). Basically, the description of an anchor point describes what a large 
majority of students at that level know and can do that a majority of students 
at lower levels cannot. 



168 



176 



Ai-PENDIX A 



Summary of Modifications in Reading Scale Results 
Used in Tables and Figures 



ERIC 



169 



177 



Figure 1.1 
Figure 1.2 



Appendix A 

Summary of Modifications in Reading Scale Results 
Used in Tables and Figures 



Expanded Conditioning 
Model for 1971- ICeS? 

Yes 
Yes 



Adjusted 1984 



Weights? 

No 
Yes 



Cb] 



Modified 1986 
Conditioning Model? 

No 
Yes 



Cb] 



1986 Results Adjusted 
for Context Effect? 

No 
Yes 



Which Set of 1988 
Results Used? ^^^ 

Not applicable 
Set 2 



Table 4.1 [All changes and adjustments are detailed in this table] 

Figure 4.1 Yes Yes Yes No 



Set 2 
Set 1 



Table 5.9 Yes Yes 

Figure 5.1 [Identical to Figure 4.1] 



Yes 



No 



Set 1 



Figure 6.1 
Table 6.1 

Tables 6.2-6.4 



[Identical to Figure 1.2] 
Yes Yes 



Yes 



Yes 



Yes 
Yes 



Unadjusted and 
adjusted results given 

Unadjusted and 
adjusted results given 



Not applicable 



Set 2 



See Chapter 4 for further detail. 
Applies to ages 9 and 13 only. 

Two sets of results were obtained for the 1988 bridge to 1984: (1) a set that uses the same 
conditioning variables as the 1988 bridge to 1986, maximizing the comparability of the results for the two 
bridges (see Chapters 4 and 5) and (2) a set that uses an expanded conditioning model that maximizes 
comparability with the 1984 results. Set 2 is most appropriate for assessing trend and is used in the most 
recent reading trend report, The Reading Report Card, 1971 to 1988 (Mullis & Jenkins, 1990). In the present 
report, figures and tables that compare the 1988 bridge to 1984 to the 1988 bridge to 1986 use Set 1; 
figures and tables that do not include the 19b8 bridge to 1986 use Set 2, 



ERIC 



178 



1' 



APPENDIX B 

Sampling and Weighting Procedures for the 1988 NAEP Bridges 



ERIC 



173 



180 



Appendix B 

SAMPLING AND WEIGHTING PROCEDURES FOR Ti''^ 1988 NAEP BRIDGES 



Eugene G> Johnson 
Keith F. Rust 



Each of the bridge samples drawn as a part of the 1988 assessment was 
designtui to replicate the administration of an earlier NAEP assessment. Thus 
the sampling and weighting procedures used in these bridges were designed to 
repeat as closely as feasible the procedures used previously. SoL.e changes 
from the previous procedures were necessary, however. In particular, the 
poststratif ication procedures^ used in 1988 differed somewhat from those used 
in 1986 and 1984; these changt.s are described below. The effects of these 
changes in procedures on proficiency scores are also given below and are shown 
to be relatively i^mall. 

THE 1988 BRIDGE SAMPLES 

The bridg»> studies included in the 1988 assessment that pertain to the 
current report are as follows: 

Bridge to 1984 : This bridge consists of samples comparable to the 1984 
main assessment and addresses the subject areas of reading and writing. The 
samples are collected by grade and by age for age 9/grade 4, age 13/grade 8, 

^In poststratif ication, the sampling weighti? are adjusted to make sample 
estimates of certain subpopulation totals conform to external, more accurate. 



estimates. 



175 



ERLC 




and age 17/grade 11, using the age definitions and time of testing equivalent 
to those used in 1984. Six assessment booklets were administered at each 
age/grade, each of these booklets consisting of at least one block of reading 
items and least one block of writing items. The administration of these 
booklets was nonpaced (that is, no audiotape was used). Thus at all three 
ages a spiral, print -administered bridge of reading and writing was conducted. 
The booklets used formed part of the spiral assessment in 1984, when reading 
and writing were both administered. For the 1984 sample these assessments 
were weighted as part of the full spiral sample, using 39 posts tratification 
cells for each age (although only 26 of these are relevant to the age 
eligibles, the group of interest across time). 

The bridge samples for 1988 consist of approximatsly 4,000 age-eligible 
and approximately 5,200 age/grade-eligible students at each age class. The 
original 1984 spiral samples consisted of 26,000 to 29,000 age/grade- eligible 
students. The level of posts tratification used in 1984 appears to be about 
the full extent possible without: giving rise to reduced gains in estimation 
efficiency. Sipce the 1988 bridge samples are based on many fewei students 
than the 1984 spiral samples, it did not seem appropriate to use the same 
poststrata for the 1988 bridge samples and so some collapsing of poststrata 
was performed. The comparability of weighting procedures of the original and 
the bridge samples will be discussed later in this appendix. 

Bridge to 1986. Ages 9 and 13: This bridge consists of samples fcr yr/.c 
9 and 13 comparable to those used for the measurement of trend in 1986. Ti„«} 
samples were collected by age only and used age definitions and time of 
testing equivalent to those used in 1984 and in the 1986 bridge to 1984. The 

176 



ERIC 



182 



subject areas addressed by this Iridge are reading, mathematics, and science* 
Three assessment booklets were administered at each of the ages 9 anc'. 13, and 
these are the same booklets as were administered tp 1986. Each booklet 
contains one blov v of reading, one block of mathematics, and one block of 
sciv ce exercises. As in 1986, the mathematics and science blocks were 
aininistered using a tape recorder while the reading blocks were administered 
by pencil and paper only. The three tape sessions at each age were conducted 
to replicate the fall and winter bridges conducted in 1986. Tb^ numbers of 
students for the two sets of samples are similar- -around 2,000 age eligibles 
each in 1986 and around 1,333 each in 1988. Although time restrictions 
prevented the exact repetition of the poststratif ication procedures, 
comparability has been maintained as much as possible (specifically, by not 
using age and grade eligibility for nonresponse adjustment ar.d 
poststratif ication) . Seven poststrata were used for each age in 1988 
(compared to eight in 1986), with five of the poststrata having the same 
definition across ^he two assessments. 

Bridge to 1986. Age 17 : This bridge consists of a sample of age 
17/grade 11 students comparable to the 1986 main assessment using an 
equivalent age definition and time of testing to that used in that assessment 
and, since those definitions are also the same, for the 1984 assessment. The 
subjec** areas addressed by this bridge are reading, mathematics, science, and 
history. Seven assessment booklets were adrlnistered to age 17/grade 11 
students. One consisted entirely of blocks of history items; the remaining 
.«';ix consisted of blocks of reading, mathematics and science items. The 
administration of these booklets was nonpaced. The books of reading. 



177 




mathematics, and science were administered as part of the full spiral sessions 
in 1986, where their purpose was to bridge to 1984. In the 1988 bridge they 
were repeated in separate spiral sessions since the age definition hi 
different from the regular Age 17 assessment in 1988. As in the other spiral 
bridges, it wa^ not possible to repeat the full level of poststratif ication 
that was used on the 1986 sample, where 26 poststratif ication cells were used 
for age-eligible students, and 39 in total. 

SAMPLE DESIGN 

The sample of students for the 1988 NAEP assessment was selectw»d using a 
complex multistage sample design involving the sampling of students from 
selected schools within 94 selected geographic regions, called primary 
sampling units (PSUs) , from across the United States. All 94 PSUs were used 
for the main 1988 assessment and subsamples of these PSUs were used f r the 
bridge assessments. The sample design, which is similar to that used in 1986, 
will be described in detail by Westat, I , the firm subcontracted by ETS to 
select the sample, in National Assessment of Educational Progress —1988 
Sampling and Weighting Procedures, Final Report. This section will provide an 
overview of the design. Since the PSUs used for the bridge assessments were 
subsamples of those used for the main assessment, the selection of the main 
assessment PSUs is given first. 

Primary Sampling Units for the Main Assessnent 

In the first stage of sampling, the nited States (the 50 states and the 
District of Columbia) was divided into geographic primary sampling units, 
where each PSU met a minimum size requirement and comprised either a 

178 



ERLC 



184 



metropolitan statistical area (MSA), a single county, oi c group of contiguous 
counties. Twelve subuniverses of PSUs were then defined as described below. 

The 34 largest PSUs were designated as certainty units because they were 
large as to be selected with probability one. The remaining, smaller, PSUs 
were noi: guaranteed to be selecccc' into the sample. These were grouped into a 
nximber of nonccrtainty strata (so caliad because the PSUs in these strata were 
not included in the sample with certainty). 

The PSUs were classified into four regions, each containing about one- 
fourth of the U.S. population. In each region, PSUs were classified as MSA or 
nonMSA. In the Southeast and West regions, the PSUs were further classified 
as; high minority (at least 20 percent of the population in the 1980 Census was 
either Black or Hispanic) or not. The resulting subuniverses are shown below. 

Table B.l 

The Sampling Subuniverses 
and the J^umber of Noncertainty Strata in Each 





MSA 


PSUs 


NonMSA 


PSUs 




Regular 


High-minority 


Regular 


High-minority 


Region 


Strata 


Strata 


Strata 


Strata 


Northeast 


8 




2 




Southeast 


4 


6 


4 


6 


Central 


8 




6 




West 


4 


6 


4 


2 


Total 


24 


12 


16 


8 



Within each major stratum (the subuniverses), further stratification was 
achieved by ordering the noncertainty PSUs accordin/; to several additional 
socioecoiiomic characteristics, yielding 60 strata. Or;e PSU was selected with 
probability proportional to size from each of the 60 noncertainty strata. 

179 

f 185 



PSUs within the high-minority subuniverses were sampled at twice the rate of 
PSUs in the other subuniverses. 

These 94 PSUs were used for the main assessments of all three age 
::lasses. To allow for the estimation of withln-school-year p:rowth in 
achievement and to match the administration times of previous assessments, the 
assessment sample was divided into two randomly equivalent subsamples, one 
subsample to be assessed in the winter and the other to be assessed in the 
spring. For this purpose, the 94 PSUs were designated as winter PSUs, spring 
PSUs, or both winter and spring PSUs according to the following scheme. The 
18 largest certainty PSUs were designated as both winter and spring PSUs, to 
be included in the sample for both seasons (the sample of schools within each 
of these PSUs was randomly split in half, one subsample to be assessed in the 
winter and one to be assessed in the spring) . The 16 smaller certainty PSUs 
were ordered by region and then alternately designated as winter PSUs or 
spring PSUs, resulting in 8 PSUs for each season. Similarly, alternate 
members oi the set of the 60 noncertainty PSUs , arranged in stratum order 
within each subuniverse, were designated as winter or spring PSUs. The end 
result was % winter PSUs, j8 in wliich assessments were conducted only in the 
winter and 18 in which assessments were conducted in both winter and spring, 
and 56 spring PSUs, consisting of 38 in which only spring assessments were 
conducted plus the 18 winter and spring PSUs. 

Primary Sampling Units for the Bridge Assessments 

The bridge assessments used a subsample of the 94 PSUs used for the main 
assessment. The age 9/grade 4 bridge assessments, V7hich i;ere conducted in the 
winter, t<ied the 56 PSUs designed as winter PSUs in the main assessment; the 



180 



ERIC 




age 17/grade 11 bridge assessments conducted in the spring, used the 56 PSUs 
designated as spring PSUs. The age 13/grade 8 bridge assessmencs, conducted 
in the fall, used 64 PSUs that were selected from the complete set of 94 PSUs 
with probability proportional to a measure of size. As with the winter and 
spring subsamples, the 18 largest certainty PSUs were retained in the fall 
bridge sample with certainty. 

Schools for Bridge Samples; the Assignment of Sessions to Schools 

School's to participate in the age 13/grade 8 bridge assessments 
(conducted in the fall) were selected from the subsample of 64 PSUs that had 
been designated as the age 13/grade 8 bridge PSUs. To avoid the possibility 
that a particular bridge session mijtht be assigned to a school with only one 
or very few eligibles, small schools were clustered with other schools in the 
same PSU to form clusters of a specified minimum number of eligibles. Bridge 
sessions were then assigned within each PSU by selecting a school cluster with 
probability proportional to the estimated number of age and grade eligibles 
within the school (or school cluster) . 

Schools to participate in the age 9/grade 4 bridge assessments 
(conducted in the winter) were selected from the subsample of the PSUs 
designated as being for the winter as^^essment. The selection was such that 
each of the distinct booklets us^d in the bridge assessments would be 
administered at least once within each of the 56 PSUs designated as winter or 
both winter and spring PSUs. Clusters of schools were formed in the same 
manner is for age 13/grade 8; in this case, two clusters were selected per 



PSU. 



181 



EMC 




In a like manner, schools to participate in the age 17/grade 11 bridge 
assessments (conducted in the spring) were selected from the subsample of the 
PSUs designated as being ^or the spring assessment such that each of the 
distinct booklets used in the bridge assessments would be administered once 
within each of the 56 spring PSUs. Two clusters of schools were selected per 
PSU. 

For all three age/grades, sessions were assigned to bridge sample 
schools in the following manner. First, the number of sessions per school was 
established. This was the maximum, up to four, that could be administered 
without creating unduly i. jall session sizes with few eligibles. Thus in most 
bridge sample schools four types of session were conducted, but, for example, 
schools with fewer than 20 eligibles were asked to conduct just a single 
session. The assignment of sessions to schools was performed so as to 
maximize the number of session types conducted within each PSU. Thus, to the 
extent feasible, session assignment was delayed until after it was determined 
that a selected school would participate in the sample. Because this happened 
sometimes but not always, two types of school nonresponse adjustment factor, 
denoted school and session, were required. 

This procedure assured that each session type was assigned in each PSU 
at least once for the age 9/grade 4 and age 17/grade 11 samples. At age 
13/grade 8, however, sometimes a PSU was represented in the sample by a «ingle 
large chool. As it was not considered feasible to administer each of five 
different session tjrpes in a single school, not all session tjrpes were 
administered in all 64 PSUs, but each session tjrpe was administered in most 
PSUs. 



182 



ERIC 




Sampling Students 

In the fourth stage of sampling, a consolidated list of all grade- 
eligible and age-eligible students was established for each school. A 
systematic selection of eligible students was made if necersary to provide the 
target sample size (otherwise all eligible students were selected) and, for 
bridge sample schools assigned both pencil and paper and paced- tape 
assessments, students were randomly assigned by Westat district supeirvrisors to 
print or tape sessions using prespecified procedures. Students assigned to 
paced- tape sessions who were not age-eligible were dropped from the 
assessment. 

Excluded Students 

Some students selected for the sample were deemed unassessable by the 
school authorities because they had limited English language proficiency, were 
judged as being educable mentally retarded, or were functionally disabled. In 
these cases, an Excluded Student Questionnaire was filled out by the school 
staff listing the reason for excluding the student and providing some 
background information. The same guidelines for exclusion were employed for 
all bridges as well as for the main assessment. For the excluded students, 
unlike the assessed students, no distinction was made as to the season of the 
year in which their school w^.« assessed since the timing of the assessment is 
unimportant for these unassessed students. Consequently, for age 9/grade 4 
and age 13/grade 8, no distinction is made between students excluded from the 
bridge assessments and the students excluded from the main assessments since 
the same grade and age eligibility definitions apply in each case. Since this 
is not the case for the third age class, the excluded students from the bridge 

183 



189 



assessments (with an October- September age definition and modal grade of 11) 
are treated as separate from the excluded students from the main assessment 
(with a calendar-year age definition and modal grade of 12) , 

PROCEDURES TO DERIVE STUDENT SAMPLE WEIGHTS 

The weight assigned to a particular student reflects two major 
components of the sample design and the population being surveyed. The first 
component, the student's base weight, reflects the probability of selection of 
the student for participation in a particular type of assessment session 
(i,e., a particular bridge assessment session or for the main assessment). As 
explained below, these base weights were adjusted for nonresponse, then 
subjected to a trimming algorithm to reduce a few excessively large weights. 
The weights were further adjusted to ensure that estimates, based on the 
weights, of certain subpopulation totals correspond to values reliably 
estimated from external sources (i.e.. Census and Current Population Survey). 
This latter form of adjustment, known as poststratif ication, reduces sampling 
variability and may also reduce the bias resulting from noncoverage and 
nonresponse. 

Apart from changes in the poststratif ication procedure, detailed below, 
the weighting procedures used for the 1988 bridges were essentially the same 
as those used in 1986 and 1984. 

As mentioned above, the base weight assigned to a student is the 
reciprocal of the probability that the student was invited to a particular 
tjrpe of assessment session. The base weight for a selected student was 
adjusted by three nonresponse factors: one to adjust for noncooperating 
schools, the second, used only in the case of bridge samples, to adjust for 

184 



ERIC 



190 



allocated sessions that were not conducted, and the third to adjust for 
students who were (or should have been) invited to the assessment but did not 
appear either in the scheduled session or a makeup session. For spiral 
sessions, the student nonresponse adjustment was made separately for two 
classes of students in a PSU by age class: those in or above the modal grade 
for their age and tho^e below. This diff^.rentiation acknowledges likely and 
observed differences between students in die two classes both in their 
'assessed abilities and in their likelihood of nonresponse. For some sessions 
in some PSUs, these two classes were combined, since one or both was too small 
to form the basis for an adjustment factor. The student nonresponse 
adjustment for students sampled for tape sessions was similar except that, to 
achieve comparability with the prior assessments, the adjustment was computed 
within a PSU for each tape booklet across all students originally selected for 
that booklet. 

A few students were assigned extremely large weights. One cause of 
large weights was underestimation of the number of eligible students in some 
schools leading to inappropriately low probabilities of selection for those 
schools. Other extremely large weights arose as the result of relatively high 
levels of nonresponse coupled with low- to-moderate probabilities of selection. 
Students with extremely large weights can have an unusually large impact on 
estimates such as weighted means. Since the variability in weights 
contributes to the variance of an overall estimate, a few extremely large 
weights are likely to produce large sampling variances of the statistics of 
interest. In such cases, a procedure of trimming the more extreme weights to 
values somewhat closer to the mean weight was applied in order to reduce the 
mean sqi^are errors of the ei cimates. 



185 




ERIC 



POSTSTRATIFICATION 

As in most sample surveys, the weight assigned to a respondent is a 



nonresponse, the respondent weights would provide unbiased estimates of the 
various subgroup proportions. However, since unbiasedness refers to average 
performance over all possible replications of the sampling, it is unlikely 
that any given estimate, based on the sample actually obtained, will exactly 
equal the population value. Furthermore, the respondent weights have been 
adjusted for nonresponse and a number of extreme weights have been reduced in 
size. 

To reduce the mean squared error of estimates, the sampling weights were 
further adjusted so that estimated population totals for a munber of specified 
subgroups of the population, based on the sum of weights of students of the 
specified type, were the same as presumablv better estimates derived from 
other sources. This adjustment, called poststratification, reduces the mean 
squared error of estimates relating to student populations that span several 
subgroups of the population. 

The poststratification procedures used for the 1988 NAEP data differ 
from those used for the 1984 and the 1986 assessments. To make the 
differences clear, the 1986 and 1984 procedures will be explained. 

1986 and 1984 Poststratification Procedures 

The same poststrati.fication procedures were used for both the 1984 and 
1986 assessments. For the spiral assessments, 13 subgroups wero defined in 



rat;dom variable that is subject to sampling variability. If there were no 



186 



ERLC 




terns of race, ethnicity, census region and community size (SDOC) as shown in 
Table Each of the 13 subgroups was further divided into three classes: 

(a) students eligible for inclusion in the sample by both 
age and grade; 

(b) students eligible for inclusion by age only; 

(c) students eligible for inclusion by grade only. 



Table B.2 

Major Subgroups for Poststratif ication in 1986 and 1984 



Subgroup 


Race 


Ethnic itv 


Region 


SDOC* 


1 


White 


Non-Hispanic 


NE 


1, 2 


2 


White 


Non-Hispanic 


NE 


3, 4, 5 


3 


White 


Non-Hispanic 


SE, Central 


1, 2 


4 


White 


Non -Hispanic 


SE, Central 


3 


5 


White 


Non-Hispanic 


SE, Central 


4, 5 


6 


White 


Non-Hispanic 


West 


1, 2 


7 


White 


Non-Hispanic 


West 


3, 4, 5 


8 


Any 


Hispanic 


NE,SE, Central 


Any 


9 


Any 


Hispanic 


West 


Any 


10 


Black 


Non-Hispanic 


NE 


Any 


11 


Black 


Non-Hispanic 


SE 


Any 


12 


Black 


Non-Hispanic 


Central, West 


Any 


13 


Other 


Non-Hispanic 


Any 


Any 



♦SDCX: (Sample Description of CocaDunity) categories: X~Big City; 2— Fringe of Big City; 3— 
Medium City; 4— Small Place; and 5~Extrema Rural. 



This resulted in 39 poststratif ication cells for each age class. The 
final weight for a student was the product of the base weight (after adjusting 
for nonresponse and after trimming to reduce the size of certain extremely 
large weights) and a poststratif ication factor whose denominator was the sum 
of those weights for the cell to which the student belongs and whose numerator 
was an adjusted estimate of the total number of students in the cell. This 
adjusted estimate was a composite of estimates from the NAEP sauiple and 
independent estimates based on projections based on Current Population Survey 

187 

183 



estimates and Census projections. The adjusted estimate :^as a weighted mean 
of the various estimates, the weights being inversely proportional to the 
approximate variance-s of the NAEP and independent estimates. 

The sample of students in each of the paced- tape administered 
assessments was much smaller than the sample for the spiral assessments. 
Consequently, some subgroups in Table B.2 were collapsed for 
poststratification as follows: 

1. 2 6, 7 

3 8, 9 

4 10, 11, 12 

5 13 

Furthermore, to achieve comparability with earlier assessments, there was no 
subdivision into eligibility classes (of students eligible by age, grade, or 
both), so there were eight poststratification cells for each age class. 

1988 Poststratification Procedures 

The poststratification in 1988 was done for each age/grade and 
separately for each of the spiral assessments and each of the tape 
assessments. Within each age/grade and assessment- type group, 
poststratification adjustment cells were defined in terms of race, ethnicity, 
and NAEP region as shown in Table B.3. 



188 

184 



Table B,3 

Major Subgroups for Poststratification in 1988 



Subgroup 



Race 



Ethnicity 



Region 



1 
2 
3 
4 
5 
6 
7 



White 
White 
White 
White 
Any 



Black 
Other 



Non-Hispanic 
Non- Hispanic 



Non-Hispanic 
Non-Hispanic 
Non-Hispanic 
Non-Hispanic 
Hispanic 



West 
Any 
Any 
Any 



NE 
SE 



Central 



This grouping resulted in seven cells for each tape session. For the 
spiral samples, each of the seven subgroups was further divided into the three 
eligibility classes: 

(a) students eligible by both age and grade; 

(b) students eligible by age only; 

(c) students eligible by grade only. 

In brief, the new poststratification procedures differ from those used 
for the 1984 and the 1986 assessments in three ways: 

1) The 1988 poststrata totals incorporate current Census Bureau 



monthly population estimates by single years of age by 
race/ethnicity groups. Such monthly estimates were not available 
at the time of the poststratification of the 1984 and 1986 
weights. The use in 1988 of estimates of in-school eligibles 
based on data relating only to the particular grade and age in 
question eliminated the need to derive year-to-year retention 
factors fox age 17 students and the need to incorporate 
projections from younger ages and lower grades, as was done in 
1984 and 1986, 



189 



.185 



2) The number of cells used in poststratif ication was reduced from 
the 39 cells used in 1986 and 1984 to the 21 cells used in 1988, 
The 21 poststrata used for 1988 vary substantially in mean 
performance level and yet are large enough to produce reasonably 
stable poststratif ication factors. The reduction in the number of 
cells from 39 to 21 was made to increase the stability of the 
poststratification factors in an effort to reduce the sampling 
variance. 

3) The 1988 poststrata totals were derived solely from CPS data and 
Census Bureau population projections and, in contrast to the 
method used in previous years, did not use any data from the 1988 
NAEP samples. 

The new procedure was adopted in order to speed up the production of the 
weights, sitice poststrata totals based only on CPS and Census data can be 
derived well in advance of the ; -jhting of the data. 

It is clearly important to ascertain the impact of these changes in 
poststratification on the estimates of subgroup proficiencies. In particular, 
it is important to establish that the measurement of trend in subgroup 
proficiencies is affected in a minimal way by this revision in procedures. 
The approach used to ascertain the effect of the change in poststratification 
procedures was to reweight the 1986 samples according to the new procedures 
and then compare the results with the previous results. (This approach is 
considerably more cost- and time-efficient than the alternative approach of 
reweighting the 1988 data according to the 1986 procedures.) 



190 



ERIC 




Tables B.4, B.5, and B.6 show the result when the age eligible students 
in die trend samples of the 1986 assessment of reading are reweighted using 
the new poststratification factors. The first two columns in each table 
compare the new procedure with the old in terms of the estimated relative 
frequencies by race/ethnicity, region, parental education, and grade. The 
last two columns compare the two procedures in terms of the mean reading 
proficiencies for those subgroups. (It should be noted that the standard 
errors of the proficiency estimates do not include the component due to the 
variability of the linear equating function- -see Appendix E for a discussion.) 

An examination of these tables shows that the effect of changing the 
poststratification procedure on mean proficiency estimates is slight: in most 
cases, the difference between the proficiency estimates based on the two 
procedures is less than one standard error (of the mean proficiency based on 
the old method) and in every case the difference is less than 1.25 standard 
errors. Since these standard errors do not include the variability due to 
equating and are, consequently, underestimates of the true standard errors of 
the mean proficiencies, the differences between estimates based on the two 
poststratification methods are well within the fluctuations to be expected by 
chance in either of the individual estimates. 

We note that the standard errors of the difference between the original 
and revised estimates are likely to be relatively .^mall, due to the high 
degree of correlation between the two sets oi estimates. However, the 
important aspects of the change in the method are the sizes of the resulting 
differences in estimates, relative to the precision of the estimates 
themselves, as discussed above. 



^191 

197 



Table B.4 



Effect of Change in Poststratification Procedures: 
Relative Frequencies and Mean Reading Proficiencies, Age 9 



Relative Frequencies 



Mean Reading 
Proficiencies 



New 

Procedure 



Old 

Procedure 



New 

Procedure 



Old 

Procedure 



Observed Race/Ethnicity 

White 76.0X(1.0) 76.5%(1.1) 

Black 15. 5% (0.5) 14. 9% (0.5) 

Hispanic 6.0X(1.1) 6.2%(1.1) 

Other 2.4%(0.5) 2.5%(0.5) 



214.7(1.5) 
186.4(1.6) 
189.0(2.9) 
204.7(6.2)1 



214. 9( 1.3) 
185. 0( 1.6) 
'l89.8( 3.3) 
203. 7( 6.6)! 



Region 

Northeast 
Southeast 
Central 
West 



20.7X(1.1) 
25.9%(2.0) 
26.2%(0.9) 
27.2%(1.6) 



21.1X(1.1) 
22.5%(4.7) 
28.6%(4.0) 
27.7X(1.6) 



212.0(3.0) 
205.2(3.2) 
211.7(2.5) 
206.0(3.1) 



212. 3( 2.7) 

202. 5( 2.7)! 

212. 9( 2.7) 

206. 5( 3.0) 



Grade 



< Modal Grade 
at Modal Grade 
> Modal Grade 



34.2X(1.7) 
65.5X(1.7) 
0.3%(0.1) 



33.9%(1.7) 
65.8%(1.7) 
0.3%(0.1) 



188.3(1.2) 
218.9(1.3) 
238.2(8.8)! 



189. 4( 1.4) 
218. 5( 1.2) 
241.9(11.3)! 



Pax-sntal Education 

Not Graduated H S 4. 3% (0.4) 4. 2% (0.4) 

Graduated H S 16.0X(0.8) 16.4X(0.7) 

Post H S 44.7X(1.2) 44.4%(1.2) 



190.1(2.9) 
201.5(1.4) 
219.2(1.4) 



189. 5( 2.8) 
202. 2( 1.9) 
219. 0( 1.3) 



Total 



208.5(1.3) 



208. 6( 1.2) 



Note: Standard errors in parentheses (standard errors do not include equating 
error) 

! Interpret with caution- -the sampling errc" cannot be accurately estimated, 
since the coefficient of variation of the estimated total number of students in the 
subpopulation exceeds 20 percent. 



ERIC 



192 

198 



Table B.5 



Effect of Change in Poststratification Procedures: 
Relative Frequencies and Mean Reading Proficiencies, Age 13 



Relative Frequencies 



Mean Reading 
Proficiencies 



Observed Race/Ethnicity 
White 
Black 
Hispanic 
Other 



New 

Procedure 



77.3Z(0.9) 
14.4%(0.8) 
6.1X(1.0) 
2.22(0.3) 



Old 

Procedure 



76.8Z(1.0) 
14.4%(0.9) 
6.6Z(1.1) 
2.22(0.3) 



New 

Procedure 



260.3(0.9) 
239.2(1.9) 
242.1(2.6) 
262.3(3.6) 



Old 

Procedure 



258.8(1.2) 
239.3(1.6) 
242.2(3.1) 
263.9(4.1) 



Region 

Northeast 
Southeast 
Central 
West 



23.9X(1.6) 
23.9X(1.9) 
25.62(0.6) 
26.7X(1.4) 



22.4%(1.6) 
24.72(5.8) 
24.92(5.0) 
28.02(1.5) 



259.6(2.2) 
254.3(1.6) 
254.6(1.3) 
256.1(1.8) 



258.7(2.1) 
254.8(1.6)! 
250.8(3.6) 
256.0(1.7) 



Grade 



< Modal Grade 
at Modal Grade 
> Modal Grade 



32.32(1.6) 
67.32(1.6) 
0.52(0.1) 



32.72(2.1) 
66.82(2.1) 
0.52(0.1) 



239.3(1.4) 
264.1(1.0) 
279.5(6.5)! 



238.4(1.4) 
263.0(0.9) 
275.8(6.0)! 



Parental Education 

Not Graduated H S 7.32(0.5) 7.82(1.0) 

Graduated H S 29.62(1.3) 30.52(1.2) 

Post H S 54.02(2.0) 52.32(2.1) 



245.4(2.2) 
249.8(1.2) 
263.7(1.0) 



244.2(2.9) 
249.3(1.1) 
262.7(0.9) 



255,0(1. 0) 



Note: Standard errors in parentheses (standard errors do not include equating 
error) 

! Interpret with caution- -the sampling error cannot be accurately estimated, 
since the coefficient of variation of the estimated total number of students in the 
subpopulation exceeds 20 percent. 



193 



1S9 



Table B.6 



Effect of Change in Poststratification Procedures: 
Relative Frequencies and Mean Reading Proficiencies, Age 17 



Relative Frequencies 



Mean Reading 
Proficiencies 



Observed Race/Ethnicity 
White 
Black 
Hi.c'panic 
Other 



Procedure 



76.6X(0.4) 
14.6X(0.2) 
6.4X(0.2) 
2.4X(0.3) 



Old 

Procedure 



78.0%(0.4) 
13.5%(0.2) 
6.2%(0.2) 
2.4%(0.3) 



New 

Procedure 



290.9(0.9) 
264.9(1.3) 
266.3(2.4) 
274.1(4.1) 



Old 

Procedure 



291.4(0.9) 
265.0(1.2) 
267.5(2.1) 
276.0(4.4) 



Region 

Northeast 
Southeast 
Central 
West 



25.42(1.2) 
24.0%(0.6) 
26.1X(0.6) 
24.5X(0.9) 



23.8X(0.3) 
21.2%(1.4) 
28.4X(1.5) 
26.5X(0.5) 



291.2(2.0) 
280.0(1.0) 
287.1(2.1) 
281.7(1.4) 



293.1(2.0) 
279.4(1.0) 
288.1(2.1) 
202.7(1.5) 



Grade 



< Modal Grade 
at Modal Grade 
> Modal Grade 



24.9%(0.6) 
65.8X(0.4) 
9.3X(0.6) 



21.8X(0.6) 
70.3X(0.4) 
7.9X(0.5) 



258.0(0.9) 
293.1(0.8) 
301.2(2.0) 



257.7(1.0) 
293.1(0.8) 
301.0(2.1) 



Parental Education 

Not Graduated H S 9.3X(0.5) 

Graduated H S 27.8X(0.9) 

Post H S 58.9X(1.3) 



8.9X(0.6) 
27.7X(0.8) 
59.4X(1.2) 



265.0(1.1) 
274.9(0.8) 
295.3(0.8) 



266.3(1.4) 
275.9(0.8) 
295.8(0.9) 



Total 



285.1(0.8) 



286.0(0.9) 



error) 



Note; Standard errors in parentheses (standard errors do not include equating 



194 



ERIC 



200 



APPENDIX C 

Revision of Poststratif ication Weights for 
Age 9/Grade 4 and Age 13/Grade 8, 1984 NAEP 



ERJC 



195 



201 



Appendix C 

RP^ISION OF POSTSTRATIFICATION WEIGHTS FOR 
AGE 9/GRADE 4 AND AGE 13/GRADE 8, 1984 NAEP 



Keith F. Rust 



A comparison of the proportions of 9-year-old students who were in grade 
4, based oh weighted data, revealed an inconsistency bet^jeen the 1984 main 
sample results and those for bridge studies in subsequent years. In 1984, the 
percentage of 9-year-old students in grade 4 was 74.9. For three subsequent 
bridges, the percentage ranged from 62.6 to 66.1. 

A consideration of the method of obtaining the separate 
poststratification factors for those students both grade and age eligible, 
those eligible by age alone, and those eligible by grade alone, used in 1984 
but not for subsequent bridges, revealed the possibility of improving the 
approach used to derive the independent estimates which constitute the moior 
component of the numerators of each poststratification fxtctcr. This 
improvement pertained to the poststratification procedure for age 9/grade 4 
and age 13/grade 8, but not age 17/grade 11. 

The possibility of improvement arose because the independent estimates 
were derived using Current Population Survey (CPS) data on the disuribution 
over grades of the population by whole years of age. These ages are as of 
early October, the time each year the CPS survey in which this information is 
collected is conducted. The age definition for ages 9 and 13 used in 1984 



197 



ERLC 




means tiiat this distribution is required as of January 1, (For age 17, and 
for all three ages for the main samples in 1986, the appropriate date is 
October 1, consistent with the CPS data,) 

E^/idence from the 1984 and 1988 NAEP samples shows clearly that the 
proportion of 9-year-olds who were in grade 4 and 13-year-olds who were in 
grade 8 declined between October 1 and the following January 1, That is, 
there were more fourth graders who had their tenth birthday during this period 
than there were fourth graders who had their ninth birthday, Tna difference 
was sufficiently great as to decrease the percentage of 9 -year-olds who were 
age-eligible by about 10 percentage points, A cimilar but less marked 
decrease also occurred at age 13, 

Independent estimates and the resulting poststratification factors were 
recomputed in a way that recognized this shift. The magnitude in the shift 
was estimated fr'm NAEP data, this being the only source of information 
available. We note that the shift proved very consistent between the 1984 and 
1988 samples, when the same age and grade definitions were used. 

The 1988 poststratification procedure, which differed from that used in 
1984 end 1986 in a number of ways, was performed in a manner that also 
accounted for this shift in the age/grade distribution. Hence, no revision of 
the 1988 poststratification factors is required. 



198 

203 



i 



APPENDIX D 

Tables of Conditioning Effects and IRT Parameters 
for Reading, Mathematics, and Science Items 



199 

. !.';! 204: 



Appendix D 

TABLES OF CONDITIONING EFFECTS AND IRT PARAMETERS 
FOR READING, MATHEMATICS, AND SCIENCE ITEMS 



Table D.l 

Conditioning Effects for 1988 Reading Bridge Samples 



Conditioning 
Variable 



Age 9 
1988 Bridge 
to 1984 to 1986 



Age 13 
1988 Bridge 
to 1984 to 1986 



Age 17 
1988 Bridge 
to 1984 to 1986 



1. 


OVERALL 


-1.184954 


-1 


.202769 


0.009881 


-0.001284 


0 


.548721 


0. 


451626 


2. 


GENDER (F) 


0.165308 


0 


.126011 


0.213211 


0.195182 


0 


.146165 


0. 


288233 


3. 


ETHN- BLACK 


-0.438853 


-0.429080 


-0.259103 


-0.326^2 


-0 


.272405 


-0. 


386952 


4. 


ETHN-HISP. 


-0.412559 


-0 


.485930 


-0.274732 


-0.465548 


-0 


,345180 


-0. 


344322 


5. 


ETHN- AS IAN 


0.357416 


0 


.214722 


0.305359 


0.139659 


-0 


.060925 


-0. 


121725 


6. 


HIGH METRO 


-0.148497 


-0 


.310135 


-0.183941 


-0.142262 


-0 


.068281 


-0. 


164850 


7. 


OTHER METRO 


0.147669 


0 


.092584 


0.113991 


0.112262 


0 


.152745 


0. 


087913 


8. 


SOUTHEAST 


-0.103437 


-0 


.132848 


0.015798 


-0.036619 


-0 


.055429 


-0. 


022728 


9. 


CENTRAL 


-0.026666 


-0 


.093962 


-0.009762 


-0.061835 


-0 


.023580 


-0. 


041082 


10. 


WEST 


-0.154755 


-0 


.085688 


-0.037010 


-0.128724 


-0.078623 


-0. 


112447 


11. 


PAR EDl(HG) 


0.291950 


0 


.284291 


0.078765 


0.147679 


0 


.273536 


0. 


134028 


12. 


PAR ED2(PH) 


0.370200 


0 


.477401 


0.329406 


0.405494 


0 


.558189 


0. 


443789 


13. 


PAR ED3(C0L) 


0.464392 


0.475387 


0.324801 


0.396613 


0 


.547234 


0.594699 


14. 


PAR ED4(MIS) 


0.163113 


0 


.134590 


-0.071777 


0.060794 


-0 


.106648 


-0. 


232533 


15. 


TV 


0.235433 


0 


.236459 


-0.027696 


0.086582 


-0 


.066412 


-0. 


014609 


16 


TV**2 


-0.038932 


-0 


.036129 


-0.001551 


-0.018358 


0 


.000159 


-0. 


013616 



ERIC 



201 



205 



Table D.2 



1986 Adjusted Reading Item Parameters, Age 9 







- A - 


- 


B - 




- c - 


Noorsol 


2 


.6295 


-0 


.9528 


0 


.3161 


N001502 


2 


.2625 


-0 


.4435 


0 


.1829 


N001503 


1 


.8960 


-0 


.7052 


0 


.2737 


N001504 


2 


.0545 


-0 


.4687 


0 


.2567 


N002001 


1 


.3980 


-0 


.0249 


0 


.1567 


N002002 


1 


.6070 


-0 


.1793 


0 


.2090 


N002003 


1 


.7060 


-0 


.2887 


0 


.2377 


N002801 


2 


.3385 


-0 


.8188 


0 


.1752 


N002802 


2 


.1880 


-0 


.9522 


0 


.1719 


N003101 


1 


.4020 


-0 


.6171 


0 


.2590 


N003102 


1 


.9155 


-0 


.3275 


0 


.2193 


N003104 


0 


.9560 


2 


.0387 


0 


.0000 


N004101 


0 


.9070 


-1 


.1044 


0 


.1996 


N008601 


1 


.9990 


-0 


.9874 


0 


.1979 


N008602 


1 


.7485 


-0 


.7121 


0 


.2457 


N008603 


1 


.7075 


-0. 


9390 


0 


.2074 


N003901 


1. 


4920 


-0. 


9869 


0. 


2490 


N008902 


1. 


5150 


-1. 


1042 


0. 


2452 


N009401 


1. 


4945 


-1. 


5595 


0. 


1307 


N009801 


2. 


1765 


-2. 


1176 


0. 


2378 


N010201 


1. 


4325 


-1. 


7527 


0. 


2057 


N010301 


0. 


8620 


-2. 


0273 


0. 


2212 


N010401 


0. 


7640 


-1. 


1312 


0. 


2283 


N010402 


1. 


2130 


0. 


0162 


0. 


2521 


N010403 


1. 


2350 


0. 


4135 


0. 


1829 


N010501 


2. 


7340 


-1. 


2871 


0. 


3298 


N010502 


1. 


3120 


-1. 


0646 


0. 


2831 


N010503 


1. 


9330 


-1. 


2786 


0. 


3015 


N010504 


2. 


7855 


-1. 


0061 


0. 


2048 


N013301 


1. 


8695 


-1. 


8405 


0. 


1849 


N014201 


1. 


4615 


-0. 


8734 


0. 


2006 



ERIC 



202 



206 



Table D.3 



1986 Adjusted Reading Item Parameters, Age 13 







• A - 


• 


B - 




• c - 


N001501 


2. 


3220 


-1, 


0862 


0, 


3161 


N001502 


1, 


9975 


-0, 


5095 


0, 


1829 


N001503 


1, 


6740 


-0, 


8058 


0, 


2737 


N001504 


1, 


8140 


-0, 


5379 


0, 


2567 


N002001 


1, 


2345 


-0, 


0354 


0 


.1567 


N002002 


1, 


4190 


-0, 


2103 


0 


.2090 


N002003 


1, 


5065 


-0, 


3341 


0 


.2377 


K002801 


2, 


0650 


-0, 


9345 


0 


.1752 


N002802 


1, 


9320 


-1, 


0855 


0 


.1719 


N003001 


1, 


1580 


1, 


3614 


0 


.1867 


N003003 


2, 


3540 


1 


.3540 


0 


.1131 


N003101 


1, 


2380 


-0 


.7061 


0 


.2590 


N003102 


1, 


6910 


-0 


.3781 


0 


.2193 


N003104 


0 


.8445 


2 


.3017 


0 


.0000 


N004601 


0 


.8275 


-0 


.0688 


0 


.1932 


N004602 


1 


.3360 


-0 


.2043 


0 


.2641 


N004603 


1 


.3740 


-0 


.6796 


0 


.2651 


N005001 


2 


.2580 


1 


.1207 


0 


.2340 


N005002 


1 


.1535 


1 


.2448 


0 


.3678 


N005003 


1 


.0380 


1 


.5894 


0 


.1366 


N008201 


3 


.7805 


-0 


.4031 


0 


.2773 


N008202 


1 


.1140 


-0 


.1301 


0 


.2268 


N008203 


1 


.6225 


-0 


.3674 


0 


.3021 


N008204 


2 


.5680 


-0 


.2234 


0 


.1897 


N008205 


3 


.0465 


-0 


.1629 


0 


.2711 



203 

' " SO? 



Table D.4- 



1986 Adjusted Reading Item Parameters, Age 17 











B - 




- c - 






.5/45 


-0 


.6543 


0 


.3161 






. 2x50 


-0 


.1341 


0 


.1829 




1 


. o5oU 


-0 


.4013 


0 


.2737 






. UXX5 


-0 


.1597 


0 


.2567 


nuuzuux 


1 


. ooo5 


0 


o n o c 

.2935 


0 


.1567 




1 




A 
0 


1 O CO 

. X35o 


0 


O Art A 

.2090 




1 


. b /Uj 


A 
0 


AO/. T 

.0241 


0 


.2377 




0 


. Zo7 J 


A 

-0 


.5174 


0 


.1752 




Z 


T /. on 
. X4ZU 


A 

-0 


. 6536 


0 


.1719 




1 


OQ/.n 


X 


c coo 
.5533 


0 


.1867 






. oXUU 


1 
X 


.5466 


0 


.1131 




1 


. i/iO 


-0 


.3114 


0 


.2590 




1 


. o/50 


-0 


.0155 


0 


.2193 




U 


. 9360 


2 


.4014 


0 


.0000 




1 


. 54x0 


-0 


.1243 


0 


.2674 




1 


, /4/0 


0 


.3168 


0 


.2264 




1 


. ooX5 


0 


.5189 


0 


.2064 




1 
1 


. o5o5 


0 


.3744 


0 


.2069 




u 


. yx/ J 


A 

0 


.2634 


0 


.1932 




J. . 


4oX5 


A 

0 


. X4X2 


0 


.2641 


1.1 \^ \/ ^ U v «^ 






A 

-u , 


0 Q7C 


A 

0 


O^CT 

. 2651 




0 

. 




X . 




A 

0 


O O / A 

. 2340 


N005002 


1 


97Qn 


X . 


A Afl9 


A 

u . 


3o /o 


N005003 


1 


X jXU 


X . 




A 

0 , 


T ICC 

X3oo 


N007301 


X . 


uuou 


A 

u . 


01 no 


A 
0 , 


233U 


N007'^09 


J. . 


uo / u 


A 

u. 


COQC 


0 . 


OT OT 

2181 


i mJKJmJ 


J. . 




A 

u , 


O O OQ 

2393 


0 . 


1 ^ OT 

1681 


M007^0A 


1 


XO? J 


A 
0. 


Q Q CO 

3353 


0 . 


2271 




u . 


oXX5 


A 
0. 


611/ 


0. 


O AO A 

2020 




T 

X . 


4 j/ J 


A 

0 . 


O Q A Q 

234o 


0. 


1 f t\'\ 
1601 




T 

X . 




A 

0. 


O T T A 

31/0 


0. 


IOC/ 

1854 


unn7An9 


1 

X . 


z/ yu 


A 

-0. 


O T CA 

3/54 


0. 


O AO O 

2088 


nuu / *tu J 


X . 




A 

0 . 


O OAT 

2o0/ 


0 . 


1 A C C 

1955 




X 


xuyu 


A 

u. 


23y3 


A 

0 . 




N007405 


1 

X • 


1 S7S 


X • 


9^^9 


A 
U . 




N008201 


4 


1920 


-0. 


0381 


0. 


2773 


N008202 


1. 


2350 


0. 


2082 


0. 


2268 


N008203 


1. 


7990 


-0. 


0059 


0. 


3021 


N008204 


2. 


8470 


0. 


1239 


0. 


1897 


N008205 


3. 


3780 


0. 


1785 


0. 


2711 


N013401 


1. 


3085 


0. 


1789 


0. 


1540 


N013402 


1. 


7845 


0. 


0858 


0. 


2624 


N013403 


2. 


1115 


0. 


3466 


0. 


2168 


N021301 


1. 


1945 


0. 


1155 


0. 


0000 



(continued) 



204 



ERIC 



208 



Table D\4 (continued) 



1986 Adjusted 


Reading Item 


Parameters , 


Age 17 




- A - 


- B - 


- C - 


N021303 


■ 1.0995 


-0.4062 


0.1905 


N021304 


0.4930 


0.6265 


0.1725 


N021305 


1.0815 


0.4609 


0.1890 


N021201 


0.9520 


0.1586 


0.1799 


N021202 


0.6940 


0.1285 


0.1946 


N021203 


0.7785 


0.3696 


0.2039 


N021204 


0.8030 


0.0307 


0.1901 


N021601 


0.6850 


0.0290 


0.2516 


N021602 


0.8675 


1.0329 


0.1544 


N021603 


0.4065 


1.3870 


0.2163 


N021604 


1.5265 


0.5088 


0.1589 


N021605 


0.8780 


1.2430 


0.3888 


N021701 


1.2380 


-0.1516 


0.2287 


N021702 


1.0115 


1.1863 


0.1079 


N021703 


1.4940 


1.3292 


0.2894 


H021801 


1.3600 


0.2600 


0.0000 


N021803 


1.3090 


0.5337 


0.2914 


N021805 


1.0665 


0.0178 


0.0000 



205 



Table D.5 



Conditioning Effects for 1986 Reading 
with Adjusted Item Parameters, Age 9 



Variable 

1 OVERALL 

2 GENDER2 

3 ETHNIC2 

4 ETHNIC3 

5 ETHNIC4 

6 ST0C2 

7 ST0C3 

fa REGI0N2 
9 REGI0N3 

10 REGI0N4 

11 PARED2 

12 PARED3 

13 PARED4 

14 PARED_ 

15 ITEMS2 

16 ITEMS3 

17 TV 

18 TV**2 

19 HW-YES 

20 HW-2345 

21 LM BY E3 

22 LM BY E4 

23 LM BY E_ 

24 LUNCK% 

25 UJNCH_ 

26 %KHITE49 

27 %WHITE79 

28 E2 X SEX 

29 E3 X SEX 

30 E4 X SEX 

31 E2 X PE2 

32 E2 X PE3 

33 E2 X PE4 

34 E2 X PE_ 

35 E3 X PE2 

36 E3 X PE3 

37 E3 X PE4 

38 E3 X PE_ 

39 E4 X PE3 

40 E4 X PE4 

41 E4 X PE_ 

42 <MA,<MG 

43 MA.MG 
(continued) 



Estimated 
Effect 

-0.449782 
0.148332 
-0.057906 
-0.224260 
-0.027006 
0.092196 
0.149317 
-0.027025 
0.037337 
0.030380 
0.058072 
0.238289 
0.210194 
0.130707 
0.020836 
0.045386 
0.077068 
-0.015100 
-0.25i901 
-0.013257 
0.044572 
-0.120663 
-0.074779 
-0.040061 
-0.046324 
-0.148280 
-0.067982 
0.134723 
0.087811 
-0.001002 
-0.179641 
-0.224820 
-0.098036 
-0. 159220 
0.030622 
-0.170710 
-0.111656 
0.058495 
0.365161 
0.259550 
0.366435 
-0.682847 
-0.420010 



1 
2 
3 
4 
5 
6 
7 



Description 
'1' FOR EVERYONE 



OVERALL CONSTANT 
SEX (FEMALE) 
ETHNICITY (BLACK) 
ETHNICITY (HISPANIC) 
ETHNICITY (ASIAN) 

SIZE AND TYPE OF COMMUNITY (HIGH METRO) 
SIZE AND TYPE OF COMMUNITY (NOT HIGH OR LOW) 

8 REGION (SOUTHEAST) 

9 REGION (CENTRAL) 

10 REGION (WEST) 

11 PARENTS EDUCATION (HIGH SCHOOL GRAD) 

12 PARENTS EDUCATION (POST HIGH SCHOOL) 

13 PARENTS EDUCATION (COLLEGE GRAD) 

14 PARENTS EDUCATION (MISSING, I DON'T KNOW) 

15 ITEMS IN HOME (FOUR OF THE FIVE) 

16 ITEMS IN HOME (FIVE OF THE FIVE) 

17 HOURS TV WATCHING (LINEAR) 

18 HOURS TV WATCHING (QUADRATIC) 

19 HOMEWORK (DON'T HAVE ANY & SOME AMOUNT) 

20 HOMEWORK AMOUNT (LINEAR) 

21 LANGUAGE MINORITY BY ETHNICITY (YES, HISPANIC) 

22 LANGUAGE MINORITY BY ETHNICITY (YES, ASIAN) 

23 LANGUAGE MINORITY BY ETHNICITY (YES, OTHER ETH) 
2/i PERCENT IN LUNCH PROGRAM (F3.2) 

25 LUNCH PROGRAM (MISSING) 

26 PERCENT WHITE IN SCHOOL (0-49% WHITE MINORITY) 

27 PERCENT WHITE IN SCHOOL (50-79% INTEGRATED) 

29 ETHNICITY BY GENDER (BLACK FEMALE) 

30 ETHNICITY BY GENDER (HISPANIC FEMALE) 

31 ETHNICITY BY GENDER (ASIAN FEMALE) 

32 ETHNICITY BY PARENT'S ED (BLACK, HS GRAD) 

33 ETHNICITY BY PARENT'S ED (BLACK, POST HS) 

34 ETHNICITY BY PARENT'S ED (BLACK, COLLEGE GRAD) 

35 ETHNICITY BY PARENT'S ED (BLACK, UNKNOWN) 

36 ETHNICITY BY PARENT'S ED (HISPANIC, HS GRAD) 

37 ETHNICITY BY PARENT'S ED (HISPANIC, POST HS) 

38 ETHNICITY BY PARENT'S ED (HISPANIC, COLLEGE) 

39 ETHNICITY BY PARENT'S ED (HISPANIC, UNKNOWN) 

41 ETHNICITY BY PARENT'S ED (ASIAN, POST HS) 

42 ETHNICITY BY PARENT'S ED (ASIAN, COLLEGE GRAD) 

43 ETHtJICITY BY PARENT'S ED (ASIAN, UNKNOWN) 

44 MODAL AGE, LESS THAN MODAL GRADE 

45 MODAL AGE, MODAL GRADE, MISSING 



206 



ERIC 



Table D.5 (continued) 



Conditioning Effects for 1986 Reading 
with Adjusted Item Parameters, Age 9 



Variable 



Estimated 
Effect 



Description 



44 SCH TYPE 

45 ASK SW? 

46 PRESCHl 

47 y/PARENTl 

48 MOTHER 

49 MOWORK 

50 SCIEN123 

51 SCIEN45 

52 COMPUTER 

53 SUPERVIS 

54 MATH Ql 

55 SCI Ql 



0.073026 48 SCHOOL TYPE (NOT PUBLIC) 

0.056887 49 FAMILY ASKS ABOUT SCHOOLWORK (AUIOST EVERY DAY) 

0.071246 50 WENT TO PRESCHOOL (YES) 

0.102494 51 SINGLE/MULTIPLE PARENT HOME (MOTHER, FATHER HOME) 

0.011320 52 MOTHER AT HOME (WORKING AND NON-WORKING) 

0.009405 53 MOTHER WORKS OUTSIDE HOME (YES) 

-0.167822 54 TIME SPENT IN SCIENCE(AT LEAST ONCE A WEEK) 

-0.171251 55 TIME SPENT IN SCIENCE(<ONCE A WEEK OR NEVER) 

0.022781 56 USE COMPUTERS FOR MATH, READING, ETC. (YES) 

0.062169 57 ADULT SUPERVISION OF STUDENT AFTER SCHOOL(YES) 

-0.298827 58 MATH 1ST QUANTILE (LINEAR -1,0,1) 

-0.229154 59 SCIENCE 1ST QUANTILE (LINEAR -1,0,1) 



207 

211 



Table D.6 



Conditioning Effects for 1986 Reading 
with Adjusted Item Parameters, Age 13 



Variable 



1 
2 
3 
4 
5 
6 
7 



OVERALL 
GENDER2 
ETHNIC2 
ETHNIC3 
ETHNIC4 
ST0C2 
ST0C3 

8 REGI0N2 

9 REGI0N3 

10 REGI0N4 

11 PARED2 

12 PARED3 

13 PARED4 

14 PARED_ 

15 ITEMS2 

16 ITEMS3 

17 TV 

18 TV**2 

19 HW-NO 

20 HW-YES 

21 HW-3456 

22 LM BY E3 

23 LM BY E4 

24 IM BY E_ 

25 LUNCH% 

26 LUNCH_ 

27 %WHITE49 

28 %WHITE79 

29 E2 X SEX 



30 E3 

31 E4 

32 E2 



X SEX 
X SEX 
X PE2 
33 E2 X PE3 



34 E2 

35 E2 

36 E3 

37 E3 

38 E3 

39 E3 

40 E4 

41 E4 

42 E4 



PE4 
PE_ 
PE2 
PE3 
PE4 
PE_ 
PE2 
PE3 
PE4 



Estimated 
Effect 



-1.178359 
0.150605 
-0.051714 
-0.062785 
0.306524 
0.088873 
0.022681 
0.130568 
0.016751 
0.007241 
-0.085747 
-0.034406 
-0.073942 
-0.086500 
0.091700 
0.088342 
0.063395 
-0.009358 
0.245607 
0.189577 
0.010437 
0.102256 
0.056094 
-0.061011 
0.019951 
-0.055597 
0.032951 
0.029780 
-0.044162 
0.016185 
0.143079 
0.032872 
0.041641 
0.026823 
0.096667 
-0.114480 
-0.161151 
-0.142465 
-0.038051 
-0.618719 
-0.502270 
-0.417888 



1 

2 
3 
4 
5 
6 
7 



D^' scription 
'1' FOR EVERYONE 



OVERALL CONSTANT 
SEX (FEMALE) 
ETHNICITY (BLACK) 
ETHNICITY (HISPANIC) 
ETHNICITY (ASIAN) 

SIZE AND TYPE OF COMMUNITY (HIGH METRO) 
SIZE AND TYPE OF COMMUNITY (NOT HIGH OR LOW) 

8 REGION (SOUTHEAST) 

9 REGION (CENTRAL) 

10 REGION (WEST) 

11 PARENTS EDUCATION (HIGH SCHOOL GRAD) 

12 PARENTS EDUCATION (POST HIGH SCHOOL) 

13 PARENTS EDUCATION (COLLEGE GRAD) 

14 PARENTS EDUCATION (MISSING, I DON'T KNOW) 

15 ITEMS IN HOME (FOUR OF THE FIVE) 

16 ITEMS IN HOME (FIVE OF THE FIVE) 

17 HOURS TV WATCHING (LINEAR) 

18 HOURS TV WATCHING (QUADRATIC) 

19 HOMEWORK (DON'T HAVE ANY) 

20 HOMEWORK (YES - SOME AMOUNT) 

21 HOMEWORK AMOUNT (LINEAR) 

22 LANGUAGE MINORITY BY ETHNICITY (YES, HISPANIC) 

23 LANGUAGE MINORITY BY ETHNICITY (YES, ASIAN) 

24 LANGUAGE MINORITY BY ETHNICITY (YES, OTHER ETH) 

25 PERCENT IN LUNCH PROGRAM (F3.2) 

26 LUNCH PROGRAM (MISSING) 

27 PERCENT WHITE IN SCHOOL (0-49% WHITE MINORITY) 

28 PERCENT WHITE IN SCHOOL (50-79% INTEGRATED) 

30 ETHNICITY BY GENDER (BLACK FEMALE) 

31 ETHNICITY BY GENDER (HISPANIC FEMALE) 

32 ETHNICITY BY GENDER (ASIAN FEMALE) 

33 ETHNICITY BY PARENT'S ED (BLACK, HS GRAD) 

34 ETHNICITY BY PARENT'S ED (BLACK, POST HS) 

35 ETHNICITY BY PARENT'S ED (BLACK, COLLEGE GRAD) 

36 ETHNICITY BY PARENT'.*' ED (BLACK, UNKNOWN) 

37 ETHNICITY BY PARENT'S ED (HISPANIC, HS GRAD) 

38 ETHNICITY BY PARENT'S ED (HISPANIC, POST HS) 

39 ETHNICITY BY PARENT'S ED (HISPANIC, COLLEGE) 



(continued) 



40 ETHNICITY BY PARENT'S 

41 ETHNICITY BY PARENT'S 

42 ETHNICITY BY PARENT'S 

43 ETHNICITY BY PARENT'S 



ED (HISPANIC, UNKNOWN) 
ED (ASIAN, HS GRAD) 
ED (ASIAN, POST HS) 
ED (ASIAN, COLLEGE GRAD) 



208 



' 212 



Table D.6 (continued) 



Conditioning Effects for i93& Reading 
with Adjusted Item Parameters, Age 13 



Estimated 

Variable Eff ect Description 

43 E4 X PE_ -0.193980 44 ETHNICITY BY FAREOT'S ED (ASIAN, UNKNOTOV) 

44 <MA,<MG -0.235984 45 MODAL AGE, LESS THAN MODAL GRADE 

45 MA,MG -0.134553 46 MODAL AGE, MODAL GRADE, MISSING 

46 SCH TYPE 0.096220 49 SCHOOL TYPE (NOT PUBLIC) 

47 ASK SW? 0.046783 50 FAMILY ASKS ABOUT SdHOOLWORK (ALMOST EVERY DAY) 

48 PRESCHl 0.067139 51 WENT TO PRESCHOOL (YES) 

49 #PARENT1 -0.028203 52 SINGLE/MULTIPLE PARENT HOME (MOTHER , FATHER HOME) 

50 MOTHER 0.070043 53 MOTHER AT HOME (WORKING AND NON-WORKING) 

51 MOWORK -0.012856 54 MOTHER WORKS OUTSIDE HOME (YES) 

52 COMPUTER -0.087101 55 USE COMPUTERS FOR MATH, READING, ETC. (YES) 

53 MATH2 0.232694 56 TYPE OF MATH CLASS (REGULAR MATH) 

54 MATHS 0.259156 57 TYPE OF MATH CLASS (PRE-ALGEBRA) 

55 MATH45 0.312297 58 TYPE OF MATH GLASS (ALGEBRA, OTHER) 

56 SCIENCE2 0.034047 59 STUDYING IN SCIENCE THIS YEAR (LIFE SCIENCE) 

57 SCIENCE3 0.077382 60 STUDYING IN SCIENCE THIS YEAR (PHYSICAL SCIENCE) 

58 SCIENCE4 6.092771 61 STUDYING IN SCIENCE THIS YEAR (EARTH SCIENCE) 

59 SCIENCE5 0.095069 62 STUDYING IN SCIENCE THIS YEAR (GENERAL SCXriiwI?) 

60 GRADES 0.163220 63 GRADES IN SCHOOL (LINEAR) 

61 MATH Ql -0.192220 64 MATH 1ST QUANTILE (LINEAR -1,0,1) 

62 SCI Ql -0.253671 65 SCIENCE 1ST QUaNTILE (LINEAR -1,0,1) 



209 



:i 213 



Table D.7 



Conditioning Effects for 1986 Reading 
with Adjusted Item Parameters, Age 17 





Estimated 


Variable 


Effect 


1 OVERALL 


-0 


.094618 


2 GENDER2 


0 


.183789 


3 ETHNIC2 


-0 


.152881 


4 ETHNIC3 


-0 


.192689 


5 ETHNIC4 


-0 


.267717 


6 ST0C2 


0 


. 128344 


7 ST0C3 


0 


.062382 


8 REGI0N2 


-0 


.013891 


9 REGI0N3 


0 


.031733 


10 REGI0N4 


-0 


.029610 


11 PARED2 


-0 


.036836 


12 PARED3 


0.048126 


13 PARED4 


0 


.056826 


14 PARED 


-0 


.185737 


15 ITEMS2 


0 


.086351 


16 ITEMS3 


0 


.116690 


17 TV 


0 


.018872 


18 TV**2 


-0 


.006411 


19 HW-NO 


-0 


.285550 


20 HW-YES 


-0 


. 141044 


21 HW-3456 


-0 


.000823 


22 LM BY E3 


-0 


056669 


23 m BY E4 


-0 


209130 


24 LM BY E 


0 


012988 


25 LUNCHZ 


-0 


120100 


26 LUNCH 


-0 


019393 


27 ZWHITE49 


0. 


008171 


28 ZWHITE79 


0. 


033535 


29 E2 X SEX 


-0. 


125090 


30 E3 X SEX 


-0. 


007812 


31 E4 X SEX 


0. 


045293 


32 E2 X PE2 


-0. 


011828 


33 E2 X PE3 


0. 


063253 


34 E2 X PE4 


-0. 


036463 


35 E2 X PE 


0. 


108169 


36 E3 X PE2 


0. 


024990 


37 E3 X PE3 


0 


074898 


38 E3 X PE4 


0. 


060779 


39 E3 X PE 


0. 


165982 


40 E4 X PE2 


0. 


076386 


41 E4 X PE3 


0. 


251174 


42 E4 X PE'i 


0. 


181287 



1 

2 
3 
4 
5 
6 
7 



Description 

FOR EVERYONE 



(continued) 



OVERALL CONSTANT ' 
SEX (FEMALE) 
ETHNICITY (BLACK) 
ETHNICITY (HISPANIC) 
ETHNICITY (ASIAIO 

SIZE AND TYPE OF COMMUNITY (HIGH METRO) 
SIZE AND TYPE OF COMMUNITY (NOT HIGH OR LOW) 

8 REGION (SOUTHEAST) 

9 REGION (CENTRAL) 

10 REGION (WEST) 

11 PARENTS EDUCATION (HIGH SCHOOL GRAD) 

12 PARENTS EDUCATION (POST HIGH SCHOOL) 

13 PARENTS EDUCATION (COLLEGE GRAD) 

14 PARENTS EDUCATION (MISSING, I DON'T KNOW) 

15 ITEMS IN HOME (FOUR OF THE FIVE) 

16 ITEMS IN HOME (FIVE OF THE FIVE) 

17 HOURS TV WATCHING (LINEAR) 

18 HOURS TV WATCHING (QUADRATIC) (F2.0) 

19 HOMEWORK (DON'T HAVE ANY) 

20 HOMEWORK (YES - SOME AMOUNT) 

21 HOMEWORK AMOUNT (LINEAR) 

22 LANGUAGE MINORITY BY ETHNICITY (YES, HISPANIC) 

23 LANGUAGE MINORITY BY ETHNICITY (YES, ASIAN) 

24 LANGUAGE MINORITY BY ETHNICITY (YES, OTHER ETH) 

25 PERCENT IN LUNCH PROGRAM (F3.2) 

26 LUIJCH PROGRAM (MISSING) 

27 PERCENT WHITE IN SCHOOL (0-49X WHITE MINORITY) 

28 PERCENT WHITE IN SCHOOL (50-79% INTEGRATED) 

30 ETHNICITY BY GENDER (BLACK FEMALE) 

31 ETHNICITY BY GENDER (HISPANIC FEMALE) 

32 ETHNICITY BY GENDER (ASIAN FEMALE) 

33 ETHNICITY BY PARENT'S ED (BLACK, HS GRAD) 

34 ETHNICITY BY PARENT'S 

35 ETHNICITY BY PARENT'S 

36 ETHNICITY BY PARENT'S 

37 ETHNICITY BY PARENT'S ED 

38 ETHNICITY BY PARENT'S 

39 ETHNICITY BY PARENT'S _ . 

40 ETHNICITY BY PARENT'S ED (HISPANIC, UNKNOWN) 

41 ETHNICITY BY PARENT'S ED (ASIAN, HS GRAD) 

42 ETHNICITY BY PARENT'S ED (ASIAN, POST HS) 

43 ETHNICITY BY PARENT'S ED (ASIAN, COLLEGE GRiVD) 



ED 



ED (BLACK, POST HS) 
ED (BLACK, COLLEGE GRAD) 
(BLACK, UNKNOWN) 
(HISPANIC, HS GRAD) 
ED (HISPANIC, POST HS) 
ED (HISPANIC, COLLEGE) 



210 



Er|c ' ■ 214 



Table D,7 (continued) 



Conditioning Effects for 1986 Reading 
with Adjusted Item Parameters, Age 17 







Estimated 




Variable 


Effect 






PA Y PP 


0.276123 


44 


A A 




-0.281517 


45 


A<^ 


nil, lib 


-0.053717 


46 


AA 


MA >vMP 


0.001677 


47 


A 7 




-0.213755 


4^^ 


Aft 


QfU TVPP 

dUiI ixirci 


0.064388 


41 


AQ 




-0.034331 


50 


SO 


iri\CiOunjL 


0.003179 


51 


SI 




0.007493 


52 


52 


MOTHFR 


-0.027155 


53 


53 


MOWORK 


-0.002359 


54 


54 


GRADES 


0.175612 


55 


55 


HS PGM2 


0.100833 


56 


56 


HS PGM3 


-0.031808 


57 


57 


NO. MATH 


0.061355 


58 


58 


NO. SCI 


0.052572 


59 


59 


P0STSEC2 


0.055939 


60 


60 


P0STSEC3 


0.147C77 


61 


61 


WORi'^OTiR 


-0.041279 


62 


62 


Elo . 23 


0.096967 


63 


63 


ENGLISH5 


-0.151264 


64 


64 


MATH Ql 


-0.152759 


65 


65 


SCI Ql 


-0.251787 


66 



Description 

JNT'S ED (ASIAN, UNKNOWN) 



SCHOOL TYPE (NOT PUBLIC) 



i SCHOOL PROGRAM (VOCATIONAL, TECHNICAL) 
OF MATH COURSES 
OF SCIENCE COURSES 



•1,0,1) (F2.0) 
\R -1,0,1) (F2.0) 



211 



ERIC 



215 



Table D.8 



NAEP 1988 IRT Parameters, Mathematics Trend Items, Age 9 



Field Block Item A SE B SE 



SE 





M1 


1 
1 


N977Am 


Ml 

njL 




n^o / JL 


M1 

njL 






M1 

17. JL 


A 


N276802 


Ml 

n JL 


D 


N276803 


Ml 

iiJL 


C 
0 


N250701 


Ml 

11 JL 


7 




Ml 

I^JL 


o 
o 


N250703 


Ml 

njL 


Q 


N262201 


Ml 

n JL 


1 0 


N2S72m 

/ JL 


M1 

njL 


1 1 




Ml 

llJL 


1 9 

JL^ 


N286101 


Ml 

IIJL 


1 '\ 

JL J 


N270001 


Ml 

11 JL 


1 A 
JL*f 


N272102 


Ml 

11 JL 


1 S 


N284001 

XI Am w W JL 


Ml 

11 JL 


1 

JLO 


N284002 


Ml 

11 JL 


17 

JL/ 


N267602 


Ml 
11^ 


1 R 

JLO 


N262501 

41 Aa W Aa a^ \/ JL 


Ml 

IIJL 


1 Q 

JL7 


N262502 

Aa W Aa a/ x/ A> 


Ml 

11 JL 




N265401 


Ml 


21 

Cm JL 


N266101 


Ml 


22 


N269101 


Ml 


2*^ 


N268201 


Ml 


24 

<a*T 


N252101 


Ml 


25 


N272301 


M2 




N276601 


M2 


0 


N257801 


M2 
11^ 


>J 


N263401 


M2 
11^ 


A 


N263402 


M2 


mJ 


N273501 


M2 
11^ 


/: 


N275401 


M2 


7 


N277501 


M2 


8 


N277601 


M2 


9 


N277602 


M2 


10 


N277603 


M2 


11 


N261401 


M2 


12 


N250601 


M2 


13 


N250602 


M2 


14 


N250603 


M2 


15 


N251401 


M2 


16 


N250901 


M2 


17 


N250902 


M2 


18 


N250903 


M2 


19 



(continued) 



0.894 (0.037) -2.165(0.098) 0.000 (0.000) 
1.026 (0.063) -1.573(0.114) 0.177 (0.038) 
1.268 (0.066) -0.611(0.049) 0.156 (0.020) 
0.490 (0.045) -3.763(0.353) 0.000 (0.000) 
0.725 (0.038) -1.591(0.090) 0.000 (0.000) 
0.621 (0.035) 0.147(0.027) 0.000 (0.000) 
0.743 (0.044) -0.850(0.059) 0.139 (0.022) 
1.001 (0.048) 0.841(0.054) 0.117 (0.011) 
1.054 (0,064) 0.015(0.033) 0.123 (0.016) 
0.441 (0.036) -1.218(0.105) 0.196 (0.024) 
1.233 (0.084) -0.533(0.055) 0.283 (0.020) 
0.963 (0.040) -0.758(0.042) 0.000 f 0.000) 
0.814 (0.039) -0.521(0.035) 0.000 (0.000) 
0.448 (0.030) -0.727(0.053) 0.000 (0.000) 
0.992 (0.062) 0.034(0.039) 0.173 (0.018) 
0.981 (0.050) -0.383(0.033) 0.000 (0.000) 
0.792 (0.037) 2.054(0.103) 0.000 (0.000) 
1.103 (0.057) -0.074(0.031) 0.104 (0.014) 
0.269 (0.031) -0.688(0.084) 0.227 (0.019) 
0.254 (0.062) 6.169(1.519) 0.172 (0.008 > 
1.582 (0.164) 2.224(0.360) 0.340 (0.011) 

0.542 (0.052) 1.917(0.192) 0.264 (O.Cil) 

0.540 (0.071) 2.970(0.402) 0.238 (C.009) 

1.248 (0.058) 1.026(0.068) 0.201 (0.010) 

0.839 (0.060) 1.752(0.143) 0.170 (0.012) 

0.946 (0.052) -1.947(0.123) 0.180 (0.040) 

1.061 (0.062) -1.010(0.076) 0.170 (0.029) 

0.588 (0.038) -0.909(0.066) 0.240 (0.022) 

0.888 (0.063) -0.701(0.063) 0.299 (0.022) 

1.010 (0.080) -0.203(0.043) 0.282 (0.018) 

0.744 (0.058) -0.684(0.068) 0.261 (0.026) 

0.985 (0.043) -0.478(0.033) 0.000 (0.000) 

0.842 (0.039) -0.421(0.031) 0.000 (0.000) 

1.438 (0.049) -0.522(0.037) 0.000 (0.000) 

1.267 (0.053) 0.172(0.029) 0.000 (0.000) 

1.507 (0.063) -0.011(0.030) 0.000 (0.000) 

0.509 (0.042) -0.145(0.037) 0.232 (0.020) 

1.097 (0.078) -0.231(0.045) 0.212 (0.019) 

0.791 (0.053) -0.584(0.054) 0.189 (0.023) 

1.366 (0.071) 0.566(0.056) 0.158 (0.013) 

0.654 (0.042) -0.265(0.038) 0.151 (0.021) 

0.599 (0.040) -0.411(0.040) 0.178 (0.019) 

1.101 (0.051) 1.181(0.072) 0.157 (0.010) 

0.970 (0.051) 0.685(0.050) 0.109 (0.012) 



212 



ERIC 



21S 



Table D,8 (continued) 
NAEP 1988 IRT Parameters, Mathematics Trend Items, Age 9 



Field 


Block 


Item 


A 


SE 


S SE 


c 


SE 


N276001 


M2 


21 


0.879 


(0.037) 


-0.975(0.049) 


0.000 


(0.000) 


N276002 


M2 


22 


0.778 


(0.035) 


1.507(0.074) 


0.000 


(0.000) 


N271101 


M2 


24 


0.626 


(0.034) 


-0.305(0.028) 


0.000 


(0.000) 


N252001 


M2 


25 


1.244 


(0.131) 


2.670(0.372) 


0.196 


(0.009) 


N269001 


M2 


26 


0.565 


(0.087) 


4.055(0.634) 


0.082 


(0.007) 


N272801 


M3 


15 


0.576 


(0.049) 


-2.007(0.176) 


0.180 


(0.036) 


N267001 


M3 


16 


0.597 


(0.045) 


-1.392(0.110) 


0.249 


(0.026) 


N272101 


M3 


17 


0.990 


(0.096) 


-0.533(0.071) 


0.286 


(0.024) 


M262401 


M3 


18 


0.594 


(0.069) 


0.928(0. lib) 


0.300 


(0.013) 


N258501 


M3 


19 


0.876 


(0.066) 


1.029(0.092) 


0.236 


(0.012) 



213 



ERIC 



217 



Table D.9 



NAEP 1988 IRT Parameters, Mathematics Trend Items, Age 13 



Field Block Item 



SE 



B 



SE 



SE 



N281901 


Ml 


15 


N254601 


Ml 


16 


N276801 


Ml 


17 


N276802 


Ml 


18 


N276803 


Ml 


19 


N277601 


Ml 


20 


N277602 


Ml 


21 


N277603 


Ml 


22 


N267201 


Ml 


23 


N286201 


Ml 


24 


N250901 


Ml 


25 


N250902 


Ml 


26 


N250903 


Ml 


27 


N262401 


Ml 


28 


N274801 


Ml 


29 


N265202 


Ml 


30 


N266801 


Ml 


31 


N252901 


Ml 


32 


N262501 


Ml 


33 


N262502 


Ml 


34 


N257601 


Ml 


35 


N265201 


Ml 


36 


N273901 


Ml 


37 


N258801 


Ml 


38 


N263101 


Ml 


39 


N265901 


Ml 


40 


N252101 


Ml 


41 


N275001 


Ml 


42 


N260101 


Ml 


43 


N269001 


Ml 


44 


N286301 


Ml 


45 


N254602 


Ml 


46 


N261001 


Ml 


47 


N286501 


Ml 


48 


N278904 


Ml 


49 


N255701 


Ml 


50 


N283101 


Ml 


51 


N277401 


M2 


8 


N277901 


M2 


9 


N277902 


M2 


10 


N277903 


M2 


11 


N263401 


M2 


12 


N263402 


M2 


13 


N250701 


M2 


14 



0.925 
1.092 
0.433 
0.493 
0.435 
0.856 
0.624 
0.617 
0.776 
0.891 
0.423 
1.020 
0.820 
0.854 
0.629 
0.843 
0.559 
1.249 
0.360 
1.216 
1.280 
0.810 
1.786 
1.273 
0.527 
0.933 
0.933 
0.946 
1.299 
1.012 
1.189 
0.744 
0.833 
1.256 
1.315 
1.317 
1.579 
0.778 
0.591 
0.688 
0.573 
0.675 
0.635 
0.688 



(0.040 
(0.054 
(0.049 
(0.044 
(0.033 
(0.036 
(0.030 
(0.031 
(0.058 
(0.051 
(0.029 
(0.049 
(0.039 
(0.054 
(0.051 
(0.074 
(0.038 
(0.072 
(0.033 
(0.068 
(0.055 
(0.062 
(0.111 
(0.055 
(0.027 
(0.060 
(0.056 
(0.040 
(0.072 
(0.053 
(0.050 
(0.045 
(0.049 
CO. 042 
CO. 057 
CO. 044 
CO. 049 
CO. 056 
CO. 033 
CO. 036 
CO. 030 
CO. 046 
CO. 045 
CO. 035 



(continued) 



-2.181i 
-1.553i 
-4.715i 
-3.957( 
-1.927( 
-2.504( 
-1.885( 
-2.287( 
-1.051( 
-0.892( 
-2.565( 
-0.349( 
-1.510( 
-0.556( 
-0.192( 
-0.176( 
-1.108( 
-0.036( 
-0.237( 
1.974( 
-0.538( 
-1.548( 
0.258( 
1.124( 
-0.291( 
0.930( 
0.623( 
0.363( 
0.415( 
0.382( 
0.660( 
1.413( 
I.OIK 
1.161( 
1.487( 
1.268( 
1.554( 
■2.903( 
■3.506< 
■3.301( 
•2.859( 
•2.751( 
■2.478( 
•2.717( 



:0.105) 
:0.089) 
:0.542) 
:0.359) 
:0.148) 

:o.ii3) 

:0.095) 

:o.ii7) 

:0.087) 

:o.o6i) 

:0.176) 

:o.o3i) 

:0.078) 
:0.048) 
:0.036) 

:o.o4i) 

:0.080) 
:0.033) 
0.034) 
0.151) 
0.035) 
0.127) 
0.047) 
0.076) 
0.024) 
0.079) 
0.054) 
0.027) 
0.042) 
0.036) 
0.046) 
0.095) 
0.070) 
0.058) 
0.097) 
0.063) 
0.080) 
0.220) 
0.199) 
0.178) 
0.154) 
0.196) 
0.181) 
0.143) 



0.146 
0.284 
0.000 
0.000 
0.000 
0.000 
0.000 
0.000 
0.254 
0.243 
0.152 
0.075 
0.096 
0.323 
0.269 
0.339 
0.248 
0.109 
0.348 
0.379 
0.000 
0.339 
0.184 
0.397 
0.000 
0.333 
0.240 
0.000 
0.16r 

0.152 
0.205 
0.235 
0.219 
0.141 
0.194 
0.139 
0.148 
0.145 
0.000 
0.000 
0.000 
0.257 
0.263 
0.106 



214 



Table D.9 (continued) 



NAEP 1988 IRT Parameters, Mathematics Trend Items, Age 13 



Field Block Item 



SE 



B 



SE 



SE 



N250702 


M2 


15 


N250703 


M2 


16 


N256101 


M2 


17 


N262201 


M2 


18 


N270301 


M2 


20 


N270302 


M2 


21 


N253701 


M2 


22 


N286601 


M2 


23 


N286602 


M2 


24 


N286603 


M2 


25 


N269101 


M2 


26 


N282201 


M2 


28 


N278902 


M2 


29 


N263501 


M2 


30 


N258802 


M2 


31 


N278901 


M2 


32 


N264701 


M2 


33 


N261501 


M2 


34 


N261801 


M2 


35 


N261601 


M2 


36 


N261301 


M2 


37 


N261201 


M2 


38 


N281401 


M2 


39 


N252G01 


M2 


40 


N258803 


M2 


41 


N278903 


M2 


42 


N286502 


M2 


43 


N275301 


M3 


25 


N282202 


M3 


26 


N266101 


M3 


27 


N254001 


M3 


28 


N269901 


M3 


29 


N256501 


M3 


30 


N265902 


M3 


31 


N256801 


M3 


32 



1.145 
0.649 
0.760 
0.520 
0.421 
1.018 
0.361 
1.698 
1.363 
1.494 
0.752 
1.063 
0.720 
1.389 
1.619 
1.559 
1.175 
0.661 
0.679 
0.344 
0.700 
0.525 
0.728 
1.423 
1.191 
1.338 
1.671 
0.372 
0.936 
0.849 
1.161 
0.664 
0.866 
1.077 
1.051 



0.051 
0.031 
0.033 
0.037 
0.031 
0.047 
0.031 
0.059 
0.051 
0.050 
0.048 
0.058 
0.051 
0.092 
0.078 
0.086 
0.056 
0.056 
0.053 
0.043 
0.048 
0.052 
0.050 
0.064 
0.044 
0.058 
0.054 
0.028 
0.066 
0.065 
0.084 
0.049 
0.069 
0.073 
0.069 



-0.797( 
-2.110( 
-1.056( 
-1.789( 
-1.596( 
2.194( 
-0.504( 
-0.194( 
-0.247( 
0.405( 
-0.384( 
0.576( 
1.338( 
0.187( 
0.484( 
0.415( 
0.867( 
-0.545( 
0.044( 
1.903( 
0.768( 
1.619( 
1.711( 
0.832( 
1.351( 
1.066( 
1.171( 
-1.728( 
-0.458( 
-G.161( 
-0.479( 
-0.274( 
0.581( 
1.170( 
0.841( 



:0.047) 
:0.106) 
:0.052) 
:0.132) 

:o.ii9) 
:o.ii8) 

:0.050) 
:0.029) 
0.027) 
0.030) 
:0.037) 
0.051) 
0.107) 
0.036) 
0.051) 
0.051) 
0.059) 
0.055) 
0.033) 
0.239) 
0.062) 
0.166) 
0.127) 
0.062) 
0.068) 
0.073) 
0.068) 
0.132) 
0.045) 
0.033) 
0.047) 
0.035) 
0.061) 
0.103) 
0.072) 



0.102 
0.110 
0.000 
0.361 
0.126 
0.051 
0.271 
0.000 
0.000 
0.000 
0.213 
0.343 
0.216 
0.115 
0.254 
0.212 
0.206 
0.141 
0.223 
0.155 
0.113 
0.219 
0.106 
0.179 
0.170 
0.169 
0.160 
0.147 
0.255 
0.292 
0.118 
0.288 
0.318 
0.328 
0.312 



(0.018) 
(0.028) 
(0.000) 
(0.023) 
(0.022) 
(0.005) 
(0.016) 
(0.000) 
(0.000) 
(0.000) 
(0.016) 
(0.011) 
(0.012) 
(0.012) 
(0.011) 
(0.013) 
(0.010) 
(0.020) 
(0.017) 
(0.012) 
(0.012) 
(0.012) 
(0.009) 
(0.010) 
(0.007) 
(0.010) 
(0.008) 
(0.022) 
CO. 017) 
CO. 014) 
CO. 017) 
CO. 015) 
CO. 012) 
CO. Oil) 
CO. Oil) 



215 

219 



Table D.IO 



NAEP 1988 Mathematics Trend Conditioning Variables, Age 9 



1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 



Variable 

OVERALL 
GENDER2 
ETHNIC2 
ETHNIC3 
ETHNIC4 
ST0C3 
STOCl 
REGI0N2 
REGIONS 
REGI0N4 
PARED2 
PARED3 
PARED4 
PARED_ 
<MODAL GRADE 
>MODAL GRADE 
ITEMS2 
ITEMS3 
E2 X SEX 
E3 X SEX 
E4 X SEX 
E2 X PE2 
E2 X PE3 
E2 X :»E4 
E2 X PE_ 
PE2 
PE3 
PE4 
PE_ 
PE2 
PE3 
PE4 
PE 



Estimated 
Effect 



Description 



E3 
E3 
E3 
E3 
E4 
E4 
E4 
E4 



X 
X 
X 
X 
X 
X 
X 
X 

SCH TYP2 
SCH TYP_ 
TVl 
TV2 
TVS 

LANGHOMS 



LANGH0M2 



0.278883 
0.434684 
0.259356 
-0.283533 
.088718 



'1' FOR EVERYONE 



LKl 0.143997 
''-1 E2 X LH2 

43 ES X LHl 

44 ES X LH2 
(continued) 



-0.279547 OVERALL CONSTANT 

-^.^kllkl GENDER (FEMALE) 

-0.706632 OBSERVED ETHNICITY (BLACK) 

0.209298 OBSERVED ETHNICITY (HISPANIC) 

0.762678 OBSERVED ETHNICITY (ASIAN) 

0.186615 SIZE AND TYPE OF COMMUNITY (HIGH METRO) 

0.087756 SIZE AND TYPE OF COMMUNITY (NOT HI&NOT LO) 

0.007280 REGION (SOUTHEAST) 

0.123942 REGION (CENTRAL) 

-0.035032 REGION (WEST) 

0.251057 PARENTS EDUCATION (HIGH SCHOOL GRAD) 

0.223869 PARENTS EDUCATION (POST HIGH SCHOOL) 

0.454556 PARENTS EDUCATION (COLLEGE GRAD) 

0.136615 PARENTS EDUCATION (MISSING, I DON'T KNOW) 

-0.728308 MODAL GRADE (LESS THAN MODAL GRADE) 

0.631198 MODAL GRADE (GREATER THAN MODAL GRADE) 

0.239816 ITEMS IN THE HOME (YES TO 3) 

0.367498 ITEMS IN THE HOME (YES TO ALL 4) 

0.087308 ETHNICITY 3Y GENDER (BLACK, FEMALE) 

-0.066049 ETHNICITY BY GENDER (HISPANIC, FEMALE) 

-0.231095 ETHNICITY BY GENDER (ASIAN AMERICAN, FEMALE) 

-0.063586 ETHNICITY BY PARENT'S ED (BLACK, HS GRAD) 

0.375105 ETHNICITY BY P/iRENT'S ED 

0.039552 ETHNICITY BY PARENT'S ED 

0.191412 ETHNICITY BY PARENT'S ED 

-0.354255 ETHNICITY BY PARENT'S ED 

0.237226 ETHNICITY BY PARENT'S ED 

-0.256883 ETHNICITY BY PARENT'S ED 

-0.246003 ETHNICITY BY PARENT'S ED 

-1.034833 ETHNICITY BY PARENT'S ED 

-0.690193 ETHNICITY BY PARENT'S ED 

-0.786758 ETHNICITY BY PARENT'S ED 

-0.518339 ETHNICITY BY PARENT'S ED 

0.158816 SCHOOL TYPE (NOT PUBLIC) 
SCHOOL TYPE (MISSING; 
0-2 HOURS OF TV WATCHING 
3-5 HOURS OF TJ WATCHING 
6+ HOURS OF TV WATCHING 

LANGUAGE IN HOME OTHER THAN ENGLISH? (ALWAYS) 40 
LANGUAGE IN HOME OTHER THAN ENGLISH7S0METIMES 41 E2 X 



(BLACK, POST HS) 
(BLACK, COLLEGE) 
(BLACK, UNKNOWN) 
(HISPANIC, HS GRAD) 
(HISPANIC, POST HS) 
(HISPANIC, COLLEGE) 
(HISPANIC, UNKNOWN) 
(ASIAN Ali, HS GRAD) 
(ASIAN AM, POST HS) 
(ASIAN AM, COLLEGE) 
(ASIAN AM, UNKNOWN) 



ETHNICITY BY LANGUAGE IN HOME (BLACK, OFTEN) 
0.080093 ETHNICITY BY LANG IN HOME (BLACK, SOMETIMES) 
0.390581 ETHNICITY BY LANGUAGE IN HOME(HISPANIC, OFTEN) 
•-0. 117348 ETHNICITY BY LANG IN HOME (HISPANIC, SOMETIMES) 



216 



ERIC 



■220 



Table D.IO (continued) 



NAEP 1988 Mathematics Trend Conditioning Variables, Age 9 







P €1 1* "f TTIi? t* ^ fl 




Variable 


Effect 


45 


E4 X LHl 


0.411867 


46 


E4 X LH2 


0.238582 


47 


TIME ASS 




48 


STUDYCMP 


-0.057134 


49 


DBACE2 


-0.069875 


50 


DRACE3 


-0.341651 


51 


DRACE4 


0.185246 



Description 

ETHNICITY BY LANGUAGE IN HOME (ASIAN AM, OFTEN) 
ETHNICITY BY LANG IN H0ME(ASIAN AM, SOMETIMES) 
TIME OF ASSESSMENT(APPLICABLE FOR Y17, N/AY19) 
ARE YOU STUDYING COMPUTERS? B004501 (YES) 
DERIVED RACE/ETHNICITY (BLACK) 
DERIVED RACE/ETHNICITY (HISPANIC) 
DERIVED RACE/ETHNICITY (ASIAN AMERICAN) 



217 



ERIC 



221 



Table D.ll 



NAEP 1988 Mathematics Trend Conditioning Variables, Age 13 



Estimated 





Variable 


Effect 


1 


OVERALL 


-1.504811 


2 


GENDER2 


-0.228401 


3 


ETHNIC2 


-0.242682 


4 


ETHNIC3 


0.086195 


5 


ETHNIC4 


0.378006 


6 


ST0C3 


0.534516 


7 


STOCl 


0.298905 


8 


REGI0N2 


-0.121025 


9 


REGIONS 


-0.063070 


10 


REGI0N4 


-0.107134 


11 


PARED2 


0.140058 


12 


PARED3 


0.197777 


13 


PARED4 


0.278975 


14 


PARED_ 


0.021061 


15 


<MODAL GRADE 


-0.480949 


16 


>MUDAL GRADE 


0.541153 


17 


ITEMS2 


0.122176 


18 


ITEMS3 


0.177230 


19 


E2 X SEX 


0.020985 


20 


E3 X SEX 


0.099927 


21 


E4 X SEX 


-0.096259 


22 


E2 X PE2 


-0.181870 


23 


E2 X PES 


-0.179468 


24 


E2 X PE4 


-0.397062 


25 


E2 X PE 


0.090978 


26 


E3 X PE2 


-0.033586 


27 


E3 X PE3 


-0.035114 


28 


E3 X PE4 


-0.359408 


29 


E3 X PE 


-0.150307 


30 


E4 X PE2 


-0.412270 


31 


E4 X PE3 


-1.023135 


32 


E4 X PE4 


0.005724 


33 


E4 X PE 


-0.148864 


34 


SCH TYP2 


0.019369 


35 


SCH TYP 




36 


TVl 


-0.192841 


37 


TV2 


-0.259867 


38 


TV3 


-0.391540 


39 


HW-NO 


0.143508 


40 


HW-YES 


0.295564 


41 


HW-345 


-0.046762 


42 


LANGH0M3 


-0.142210 


43 


LANGH0M2 


0.050961 


(continued) 





Description 

OVERALL CONSTANT '1' FOR EVERYONE 

GENDER (FEMALE) 

OBSERVED ETHNICITY (BLACK) 

OBSERVED ETHNICITY (HISPANIC) 

OBSERVED ETHNICITY (ASIAN) 

SIZE AND TYPE OF COMMUNITY (HIGH METRO) 

SIZE AND TYPE OF COMMUNITY (NOT HI&NOT LO) 

REGION (SOUTHEAST) 

REGION (CENTRAL) 

REGION (WEST) 

PARENTS EDUCATION (HIGH SCHOOL GRAD) 
PARENTS EDUCATION (POST HIGH SCHOOL) 
PARENTS EDUCATION (COLLEGE GRAD) 
PARENTS EDUCATION (MISSING, I DON'T KNOW) 
MODAL GRADE (LESS THAN MODAL GRADE) 
MODAL GRADE (GREATER THAN MODAL GRADE) 
ITEMS IN THE HOME (YES TO 3) 
ITEMS IN THE HOME (YES TO ALL 4) 
ETHNICITY BY GENDER (BLACK, FEMALE) 
ETHNICITY BY GENDER (HISPANIC, FEMALE) 
ETHNICITY BY GENDER (ASIAN AMERICAN, FEMALE) 
ETHNICITY BY PARENT'S ED (BLACK, HS GRAD) 
ETHNICITY BY PARENT'S ED (BLACK, POST HS) 
ETHNICITY BY PARENT'S ED (BLACK, COLLEGE) 
ETHNICITY BY PARENT'S ED (BLACK, UNKNOWN) 
ETHNICITY BY PARENT'S ED (HISPANIC, HS GRAD) 
ETHNICITY BY PARENT'S ED (HISPANIC, POST HS) 
ETHNICITY BY PARENT'S ED (HISPANIC, COLLEGE) 
ETHNICITY BY PAPxENT'S ED (HISPANIC, UNKNOWN) 
ETHNICITY BY PARENT'S ED (ASIAN AM, HS GRAD) 
ETHNICITY BY PARENT'S ED (ASIAN AM, POST HS) 
ETHNICITY BY PARENT'S ED (ASIAN AM, COLLEGE) 
ETHNICITY BY PARENT'S ED (ASIAN AM, UNKNOWN) 
SCHOOL TYPE (NOT PUBLIC) 
SCHOOL l-YPE (MISSING) 
0-2 HOURS OF TV WATCHING 
3-5 HOURS OF TV WATCHING 
6+ HOURS OF TV WATCHING 
HOMEWORK (NONE ASSIGNED) 
HOMEWORK (YES - SOME AMOUNT) 
HOMEWORK (LINEAR AMOUNT) 

LANGUAGE IN HOME OTHER THAN ENGLISH? (ALWAYS) 
LANGUAGE IN HOME OTHER THAN ENGLISH (SOMETIMES) 



ERIC 



218 



•222 



Table D.ll (continued) 



NAEP 1988 Mathematics Trend Conditioning Variables, Age 13 







Estimated 






v^xiaD j.e 


Errect 


Description 


It/. 


170 V TtJI 


U. 1005/9 


ETHNICITY BY LAITGUAGE IN HOME (BLACK, OFTEN) 


AS 


170 V T UO 


0*0519oA^ 


ETHNICITY BY lANG IN HOME (HISP. , SOMETIMES) 




T7Q V TtJl 


0*032823 


ETHNICITY BY LANGUAGE IN HOhE (HISP. , OFTEN) 


L7 


CiO A LiilZ 




ETHNiCiXi BY LANG IN HOME (HISP, , SOMETIMES) 


to 


T?Zl V TPI 
fiH A J.ill± 




1 TXTT firry T5^T T AXT^TTA^T' TVT TTr>»#T^ / * <^T A XT A oT^rtT^^TV 

ETHNICITY BY LANGUAGE IN HOME (ASIAN AM, OFTEN) 


49 


E4 X TJH2 




TTTHMTPTTV HV TAMP TM ur\luri7 /ACTAIJ AM or\iurT?TTiufT?o \ 


50 


GRADES 


0.329379 


GRADES IN SCHOOL 


51 


TYPEMAT2 


0.557133 


TYPE OF MATH CLASS (REGULAR MATH) 


52 


TYPEHAT3 


0.860079 


TYPE OF MATH CLASS (PRE-ALGEBRA) 


53 


TYPEMAT4 


1.067878 


1/PE OF MATH CLASS (ALGEBRA, OTHER) 


54 


TIME ASS 




TIME OF ASSESSMENT (APPLICABLE Y17, N/A Y19) 


55 


STUDYCMP 


0.000685 


ARE YOU STUDYING COMPUTERS? B004501 (YES) 


56 


DRACE2 


0.021696 


DERIVED RACE/ETHNICITY (BLACK) 


57 


DRACE3 


-0.262241 


DERIVED RACE/ETHNICITY (HISPANIC) 


58 


DRACE4 


0.239560 


DERIVED RACE/ETHNICITY (ASIAN AMERICAN) 



ERIC 



219 



2.23 



Table D.12 



NAEP 1988 IRT Parameters, Mathematics Trend Items, Age 17 



Field 


Block Item 




A 




SE 


B SE 




c 


SE 


N256101 


Ml 


15 


1 


.003 


(0 


.029) 


-1.407(0.051) 


0 


.000 


(0.000) 


N260601 


Ml 


16 


1 


.499 


(0 


.035) 


-1.136(0.043) 


0 


.000 


(0,000) 


N262401 


■Ml 


17 


0 


.920 


(0 


.040) 


-1.326(0.068) 


0 


.255 


(0,025) 


N258804 


Ml 


18 


0 


.682 


(0 


.037) 


-1.852(0.105) 


0 


.254 


(0,029) 


N2 86001 


Ml 


19 


0 


.766 


(0 


.035) 


-0.944(0.051) 


0 


.169 


(0,020) 


N286002 


Ml 


20 


0 


.855 


(0 


.032) 


-1.658(0.071) 


0 


.121 


(0,027) 


N286302 


Ml 


22 


1 


.088 


(0 


.056) 


-0.439(0.044) 


0 


.289 


(0,018) 


N278501 


Ml 


23 


1 


.030 


(0 


.033) 


-0.759(0.035) 


0 


.000 


(0,000) 


N278502 


Ml 


24 


0 


.895 


(0 


.032) 


-0.559(0.030) 


0 


.000 


(0,000) 


N278503 


Ml 


25 


0 


.900 


(0 


.030) 


-0.831(0.037) 


0 


.000 


iv. .000) 


N258802 


Ml 


26 


1 


.728 


(0 


.089) 


-0.175(0.042) 


0 


.256 


(0,014) 


N2 54602 


Ml 


27 


1 


.575 


(0 


.070) 


-0.02/f(0.036) 


0 


.211 


(0,012) 


N2 59901 


Ml 


28 


1 


.235 


(0 


.066) 


-0.225(0.037) 


0 


.289 


(0,014) 


N2 87101 


Ml 


29 


1 


.358 


(0 


.060) 


-0.382(0.037) 


0 


.202 


(0,014) 


N2 70301 


Ml 


30 


0 


.942 


(0 


.036) 


-1.403(0.063) 


0 


.140 


(0,026) 


N270302 


Ml 


31 


1 


.586 


(0 


.059) 


0.119(0.031) 


0 


.067 


(0,009) 


N255701 


Ml 


32 


1 


.451 


(0 


.061) 


-0.609(0.045) 


0 


.201 


(0,018) 


N254301 


Ml 


33 


1 


.035 


(0 


.051) 


0.084(0.033) 


0 


.258 


(0,013) 


N286502 


Ml 


34 


1 


.797 


(0 


.097) 


-0.123(0.038) 


0 


.181 


(0.013) 


N260901 


Ml 


35 


2 


.210 


(0 


.113) 


0.086(0.045) 


0, 


157 


(0.012) 


N256801 


Ml 


36 


1 


.299 


(0 


.062) 


-0.268(0.038) 


0 


.265 


(0.015) 


N258803 


Ml 


37 


0 


.992 


(0 


.045) 


0.250(0.033) 


0 


.222 


(0.011) 


N262601 


Ml 


38 


0 


.756 


(0 


.039) 


0.432(0.038) 


0, 


233 


(0.012) 


N253901 


Ml 


39 


1, 


647 


(0, 


083) 


0.011(0.041) 


0. 


259 


(0.013) 


N253902 


Ml 


40 


0, 


930 


(0, 


057) 


1.032(0.084) 


0. 


479 


(0.011) 


N253903 


hi 


41 


1. 


168 


(0, 


048) 


0.915(0.060) 


0. 


322 


(0.011) 


N253904 


Ml 


42 


1. 


576 


(0. 


062) 


0.700(0.058) 


0. 


359 


(0.011) 


N263001 


Ml 


43 


0. 


664 


(0. 


026) 


0.707(0.035) 


0. 


000 


(0.000) 


N278905 


Ml 


44 


1. 


178 


(0. 


046) 


1.053^0.063) 


0. 


283 


(0.010) 


N287301 


Ml 


45 


0. 


793 


(0. 


030) 


0.120(0.022) 


0. 


000 


(0.000) 


N287302 


Ml 


46 


0. 


994 


(0. 


031) 


1.226(0.048) 


0. 


000 


(0.000) 


N264301 


Mi 


A "7 
4/ 


0. 


800 


(0. 


028) 


0.888(0.040) 


0. 


000 


(0.000) 


N282801 


Ml 


48 


1. 


806 


(0. 


054) 


1.310(0.075) 


0. 


206 


(0.010) 


N251101 


Ml 


49 


1. 


166 


(0. 


035) 


0.949(0.041) 


0. 


000 


(0.000) 


N254601 


M2 


15 


1. 


300 


(0. 


049) 


-1.815(0.089) 


0. 


237 


(0.037) 


N262301 


M2 


17 


0. 


517 


(0. 


035) 


-1.239(0.089) 


0. 


233 


(0.023) 


N263201 


M2 


18 


0. 


973 


(0. 


050) 


-1.348(0.080) 


0. 


361 


(0.026) 


N263202 


M2 


19 


0. 


659 


(0. 


042) 


-0.434(0.041) 


0. 


352 


(0.016) 


N260101 


M2 


20 


1. 


460 


(0. 


055) 


-0.973(0.054) 


0. 


195 


(0.023) 


N25400I 


M2 


21 


0. 


923 


(0. 


044) 


-0.847(0.050) 


0. 


186 


(0.020) 


N269001 


M2 


22 


0. 


938 


(0. 


046) 


-0.398(0.034) 


0. 


169 


(0.016) 


N278901 


M2 


23 


1. 


129 


(0. 


056) 


-0.229(0.034) 


0. 


232 


(0.015) 


N261501 


M2 


24 


0. 


775 


(0. 


036) 


-2.237(0.113) 


0. 


166 


(0.034) 


N261801 


M2 


25 


0. 


589 


(0. 


032) 


-1.985(0.114) 


0. 


211 


(0.029) 



(continued) 

220 

er|c :l}2i 



Table D.12 (continued) 



NAEP 1988 IRT Parameters, Mathematics Trend Items, Age 17 



Field 


Block 


T fpm 






R 


J 1-1 


n 


PC* 


N261201 


M9 


26 


0 SI 0 








0 91 S 


\yJ • \J£M ) 


N261601 


M9 


97 


w • • ^ / 




0 TOR 




n 9^10 


(c\ nl9^ 


N261301 




28 




(0 031^ 


-1 900 


fO 074^ 




fn n99^ 


N281401 




29 






-0 9AS 




n 1 no 


fn nls^ 


N280401 




30 










0 nnn 


fO nnn^ 


N259001 




31 


1 188 


(0,045) 


-0 91 R 


'0,025) 


0 nnn 


fa nnn^ 


N287102 




32 


1 114 


(0,050) 


-0 556 




0 1 79 


fo nlR^ 


N286301 


M2 


33 


1.350 


(0,071) 


-0 450 




0 991 


^n ni7^ 


N286501 




34 


1 149 


(0 049^ 


-0 847 




0 149 


^n n9i ^ 


N262501 




35 


0 878 


(0,060) 


0 217 




0 477 


rn nl^^ 

^ V/ , ux^ ^ 


N262502 


M2 


35 


0.598 


(0,045) 


1 756 


0,141) 


0 36S 


c'n nin^ 

V/ , v XV/ / 


N263101 




37 


0 754 


(0,032) 


-0 569 




0 nnn 


nnn^ 


N258801 


M2 


38 


1 904 


(0 110^ 


0 916 


'0.048) 


n 9RA 


fn n19^ 


N264701 


M9 


39 


1.578 


(0.082) 


-0 033 


'0.038) 


0 91 6 


^n nis^ 


N261001 


M2 


40 


0.806 


(0.046) 


-0 734 


;0.052) 


0 916 


^n n99^ 


N251701 


M2 


41 


0.892 


(0.046) 


0.005( 


JO. 029) 


0 147 


(0 nis^ 


N278902 


M2 


42 


1.162 


(0.065) 


0.0141 


io.036) 


0 236 


^0 ni6^ 


N260801 


M2 


43 


1.301 


(0.044) 


0.3881 


[0.030) 


0,000 


(0.000) 


N278903 


M2 


44 


1.921 


(0.092) 


0.365( 


[0.051) 


0,227 


(0.013) 


N255601 


M2 


45 


1.248 


(0.059) 


1.576( 


[0.107) 


0,332 


(0.011) 


N255301 


M2 


46 


1.539 


(0.052) 


1.503( 


[0.086) 


0,219 


(0.009) 


N268901 


M2 


47 


1.691 


(0.062) 


0.639( 


[0.054) 


0,184 


(0.012) 


N268801 


M2 


48 


0.917 


(0.039) 


1.654( 


[0.085) 


0,102 


(0.009) 


N255801 


M2 


49 


0.679 


(0.030) 


1.668( 


[0.080) 


0,000 


(0.000) 


N266501 


MS 


31 


0.775 


(0.060) 


-0.326< 


[0.041) 


0,244 


(0.017) 


N271301 


M3 


32 


1.374 


(0.120) 


0.185( 


[0.048) 


0,261 


(0.014) 


N255501 


M3 


33 


0.808 


(0.054) 


0.668( 


[0.059) 


0,232 


(0.013) 


N256001 


MB 


34 


1.055 


(0.068) 


0.066( 


[0.027) 


0,000 


(0.000) 


N257101 


M3 


35 


0.579 


(0.054) 


1.853( 


[0.181) 


0,254 


(0.011) 



• 221 



Table D.13 



NAEP 1988 Mathematics Trend Conditioning Variables, Age 17 



Estimated 





Variable 


Effect 


1 


OVERALL 


0 466202 


2 


GENDER2 


-0 227644 


3 


ETHNIC2 


-0.326424 


4 


ETHNICS 


-0.125207 


5 


ETHNIG4 


-0.542147 


6 


ST0C3 


0.355679 


7 


STOCl 


0.268174 


8 


REGI0N2 


-0.035567 


9 


REGIONS 


0.092946 


10 


REGI0N4 


0.041544 


11 


PARED2 


-0.009106 


12 


PARED3 


0.276562 


13 


PARED4 


0.215802 


14 


PARED_ 


0.039054 


15 


<M0DAL GRADE 


-0.212266 


16 


>M0DAL GRADE 


-0.091063 


17 


ITEMS2 


0.032057 


18 


ITEMS 3 


0.089343 


19 


E2 X SEX 


0.130167 


20 


E3 X SEX 


0.294555 


21 


E4 X SEX 


-0.190247 


22 


E2 X PE2 


-0.014269 


23 


E2 X PE3 


-0.186204 


24 


E2 X PE4 


-0.163440 


25 


E2 X PE_ 


-0.256462 


26 


E3 X PE2 


0.037801 


27 


E3 X PE3 


-0.197622 


28 


E3 X PE4 


-0.148578 


29 


E3 X PE 


0.076608 


30 


E4 X PE2 


1.148569 


31 


E4 X PE3 


0.548141 


32 


E4 X PE4 


-0.003476 


33 


E4 X PE_ 


0.555852 


34 


SCH TYP2 


-0.130104 


35 


SGH TYP 




36 


TVl 


-1.980878 


37 


TV2 


-1.992986 


38 


TVS 


-2.079726 


39 


HW-NO 


-0.243494 


40 


HW-YES 


0.104266 


41 


HW-S45 


-0.024606 


42 


LANGHOMS 


-0.306630 


43 


LANGH0M2 


-0.027324 


(continued) 





Description 

OVERALL CONSTANT '1' FOR EVERYONE 

GENDER (FEMALE) 

OBSERVED ETHNICITY (BLACK) 

OBSERVED ETHNICITY (HISPANIC) 

OBSERVED ETHNICITY (ASIAN) 

SIZE AND TYPE OF COMMUNITY (HIGH METRO) 

SIZE AND TYPE OF COMMUNirf (NOT HI&NOT LO) 

REGION (SOUTHEAST) 

REGION (CENTRAL) 

REGION (WEST) 

PARENTS EDUCATION (HIGH SCHOOL GRAD) 
PARENTS EDUCATION (POST HIGH SCHOOL) 
PARENTS EDUCATION (COLLEGE GRAD) 
PARENTS EDUCATION (MISSING, I DON'T KNOW) 
MODAL GRADE (LESS THAN MODAL GRADE) 
MODAL GRADE (GREATER THAN MODAL GRADE) 
ITEMS IN THE HOME (YES TO S) 
ITEMS IN THE HOME (YES TO ALL 4) 
ETHNICITY BY GENDER (BLACK, FEMALE) 
ETHNICITY BY GENDER (HISPANIC, FEMALE) 
ETHNICITY BY GENDER (ASIAN AMERICAN, FEMALE) 
ETHNICITY BY PARENT'S ED (BLACK, HS GRAD) 



ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
ETHNICITY BY PARENT'S 
SCHOOL TYPE (NOT PUBLIC) 
SCHOOL TYPE (MISSING) 
0-2 HOURS OF TV WATCHING 
S-5 HOURS OF TV WATCHING 
6+ HOURS OF TV WATCHING 
HOMEWORK (NONE ASSIGNED) 
HOMEWORK (YES - SOME AMOUNT) 
HOMEWORK (LINEAR AMOUKT) 

LANGUAGE IN HOME OTHER THAN ENGLISH? (ALWAYS) 
LANGUAGE IN HOME OTHER THAN ENGLISH (SOMETIMES) 



ED (BLACK, POST HS) 
ED (BLACK, COLLEGE) 
ED (BLACK, UNKNOWN) 
ED (HISPANIC, HS GRAD) 
ED (HISPANIC, POST HS) 
ED (HISPANIC, COLLEGE) 
ED (HISPANIC, UNKNOWN) 
ED (ASIAN AM, HS GRAD) 
ED (ASIAN AM, POST HS) 
ED (ASIAN AM, COLLEGE) 
ED (ASIAN AM, UNKNOWN) 



222 



ERIC 



Table D.13 (continued) 



NAEP 1988 Mathematics Trend Coviditioning Variables, Age 17 







Estimated 






Variable 


Effect 


Descritition 




E2 X LHl 


0.234334 


ETHNICITY BY LANGUAGE IN HfiMF ^RTACK nFTFN^ 


45 


E2 X LH2 


-0.085786 


ETHNICITY BY LANG IN HOMF ^HT^P ^nMFTTMF^^ 

xj x&ui xwx XX XI X X4nxiV7 xii nwiiCi ^ nx Ox y OUllJCi X XmZiO y 


46 


E3 X LHl 


0.372056 


ETHNICITY BY LANGUAGE IN HOME /HISP nPTFN^ 


47 


E3 X LH2 


0.068137 


ETHNICITY BY LANG IN HOME ^HISP <50MFTTMF<;^ 

JuxAUixv^xxx X iijru>\7 Xfci ilwiiU ^XiXwx 1 i^UlXJCi X XiiCiO / 


48 


E4 X LHl 


0.542742 


ETHNICITY BY lANG IN HOME ^ASIAN AM OFTFN^ 


49 


E4 X LH2 


0.390736 


ETHNICITY BY LANG IN HOME ^ASIAN AM SOMETIMES^ 


50 


NMATHl 


-0.221100 


HIGHEST LEVEL MATH TAKEN (PRE-ALGEBRA) 


51 


NMATH2 


0.252774 


HIGHEST LEVEL MATH TAKEN (ALGEBRA) 


52 


NMATH3 


0.354687 


HIGHEST LEVEL MATH TAKEN (GEOMETRY) 


53 


NMATH4 


0.700470 


HIGHEST LEVEL MATH TAKEN (ALGEBRA- 2) 


54 


NMATH5 


1.208891 


HIGHEST LEVEL MATH TAKEN (CALCULUS) 


55 


COMPUTER 


-0.009892 


COMPUTER CLASS TAKEN ? (YES) 


56 


GRADES 


0.293596 


GRADES IN SCHOOL 


57 


HSPR0G2 


0.196396 


HIGH SCHOOL PROGRAM (COLLEGE PREP) 


58 


HSPR0G3 


-0.090029 


HIGH SCHOOL PROGRAM (VOC/TECH) 


59 


DRACE2 


0.119675 


DERIVED FJ^CE/ETHNICirf (BLACK) 


60 


DRACE3 


-0.202548 


DERIVED RACE/ETHNICITY (HISPANIC) 


61 


DRACE4 


-0.056777 


DERIVED RACE/ETHNICITY (ASIAN AM-JRICAN) 



223 

ERIC ' 227 



Table D.14 



NAEP 1988 IRT Parameters, Science Trend Items, Age 9 



Field Block Item 



SE 



B 



SE 



SE 





cl 
d1 


11 

o 




cl 

D JL 


Q 
O 






0 




Cl 


1 o 




Cl 


11 


VT/, An/i AA 


Cl 


1 O 

L/, 


vr/, nn A n ^ 


Cl 


1 7 
LJ 




Cl 


1 A 
Itf 


MA f\rs 1 m 


cl 


13 


o/,nm no 


cl 


lo 




cl 


1/ 




cl 


lo 




cl 


1 Q 
ly 


MAm nm 


cl 


90 


KJAm 1 m 

riH-UJLJLUJL 


cl 

D JL 


91 
ill 


MAm 9m 


Cl 

D JL 


99 




Cl 

D JL 


97 




C9 


1 
1 




<;9 


9 






A 








MAm ftm 


<:9 


o 


MAm 009 




7 




<>9 


Q 
O 






Q 


MAOl Q01 




"lO 


MA09001 




1 1 

JL JL 


MAO 9 009 


<Z9 


1 9 

JLiC 


KrA0900*> 


C9 


13 


MA091 01 
JNHUil JLUJL 


C9 


10 


MA09901 


C9 


1 7 
1/ 


N402401 


S2 


18 


N402501 


S2 


19 


N402602 


S2 


21 


N402701 


S2 


23 


N402801 


S2 


24 


N402901 


S2 


25 


N403001 


S3 


12 


N403101 


S3 


13 


N403201 


S3 


14 


N403202 


S3 


15 


N403301 


S3 


16 


N403401 


S3 


17 


N403501 


S3 


18 



0.650 
0.993 
1.246 
1.829 
0.566 
1.164 
1.012 
0.545 
0.294 
0.455 
0.647 
0.741 
0.333 
0.542 
0.292 
0.851 
0.504 
0.260 
0.599 
0.304 
0.299 
0..686 
0.570 
0.455 
0.346 
0.469 
0.935 
1.224 
0.712 
0.562 
0.231 
0.253 
0.622 
0.401 
0.453 
1.084 
0.373 
0.422 
0.638 
0.404 
0.291 
0.62A 
0.234 
0.563 



CO. 056 
[0.113 
[0.092 
[0.126 
[0.063 
[0.098 
[0.095 
[0.063 
[0.069 
[0.076 
[0.062 
[0.066 
[0.049 
[0.053 
(0.048 
(0.080 
(0.060 
(0.047 
[0.058 
(0.059 
[0.059 
[0.109 
(0.082 
[0.075 
[0.068 
[0.072 
(0.091 
[0.106 
(0.103 
(0.061 
(0.039 
(0.051 
(0.090 
(0.063 
(0.058 
(0.083 
(0.094 
(0.062 
(0.062 
(0.048 
(0.038 
(0.056 
(0.047 
(0.067 



-1.173( 
-0.130( 
-1.214( 
■0.733( 
-1.941( 
-0.651( 
-0.748( 
0.593< 
2.732< 
1.909< 
-0.202( 
0.070< 
1.804< 
0.729< 
1.737< 
2.036< 
1.478< 
0.249< 
-1.492< 
0.556< 
1.035< 
-0.035< 
-0.962< 
-0.279< 
1.698< 
1.855< 
-1.045( 
-1.036( 
-0.510( 
-0.332( 
0.333( 
2.764( 
2.692( 
-0.686( 
1.980( 
2.031( 
4.734( 
-5.043( 



,422 
,042 
,195 
,079 
,435 



(continued) 



0.257 



224 



0.109) 
0.055) 
0.117) 
0.089) 
0.223) 
0.078) 
0.090) 
0.083) 
0.643) 
0.329) 
0.044) 
0.040) 
0.268) 
0.082) 
0.288) 
0.215) 
0.183) 
0.060) 
0.150) 
0.118) 
0.209) 
0.057) 
0.147) 
0.068) 
0.338) 
0.291) 
0.118) 
0.115) 
0.091) 
0.051) 
0.067) 
0.561) 
0.407) 
0.117) 
0.261) 
0.189) 
1.194) 
0.745) 
0.342) 
0.368) 
:0.161) 
:0.105) 
:0.095) 
:0.057) 



0.237 
0.340 
0.417 
0.280 
0.422 
0.322 
0.390 
0.330 
0.460 
0.424 
0.225 
0.202 
0.253 
0.210 
0.275 
0.243 
0.259 
0.347 
0.207 
0.452 
0.443 
0.447 
0.432 
0.440 
0.424 
0.318 
0.381 
0.386 
0.411 
0.206 
0.245 
0.235 
0.258 
0.439 
0.199 
0.161 
0.185 
0.238 
0.232 
0.212 
0.238 
0.218 
0.331 
0.400 



ERIC 



Table D.14 (continued) 



KAEP 1988 IRT Parameters, Science Trend Items, Age 9 



Field Block Item 



SE 



B 



SE 



SE 



N403502 
N403503 
N403601 
N403701 
N403702 
N403703 
N403801 
N403803 
N403804 
N403901 
N404001 
N404201 
8105013. 
8202072. 
8204035. 
8204085- 
8303086- 
8C17C04- 
8C21C08- 
8C23C11- 
8C24C07- 
8G52C03- 
8G52C04- 
8G54C10. 
8G55C03- 
8G56G02- 
8G58G10- 
8G61G09- 
8G63G13- 
8G71C09- 
8G71G12- 
8G71G13- 
8G71G13- 
8G71G13- 
8G71G13- 
8G82G08- 



S3 


19 


S3 


20 


S3 


21 


S3 


22 


S3 


23 


S3 


24 


S3 


25 


S3 


27 


S3 


28 


S3 


29 


S3 


30 


S3 


31 



•AOOl/OOl 
■AOOl/OOl 
•AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
■AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
AOOl/OOl 
A002/002 
A003/003 
A004/004 
AOOl/OOl 



0.551 
0.412 
0.811 
3.290 
3.150 
2.076 
0.359 
0.497 
0.484 
0.653 
0.203 
0.425 
0.504 
0.606 
0.547 
0.412 
0.308 
0.686 
0.894 
0.881 
0.512 
0.369 
1.116 
0.522 
0.398 
0.704 
0.665 
0.503 
0.370 
0.578 
0.947 
0.622 
0.605 
0.546 
0.498 
0.522 



(0.059 
(0.060 
(0.069 
(0.390 
(0.247 
(0.204 
(0.057 
(0.056 
(0.063 
(0.056 
(0.036 
(0.050 
(0.068 
(0.175 
(0.062 
(0.056 
(0.060 
(0.118 
(0.112 
(0.084 
(0.075 
(0.105 
(0.119 
(0.097 
(0.079 
(0.094 
(0.098 
[0.065 
[0.061 
[0.066 
[0.103 
[0.095 
[0.091 
[0.168 
[0.075 
[0.081 



-1.918 
0.152 
0.534 
-0.287 
-0.496 
-0.3261 
1.0821 
-0.9911 
-0.5061 
-0.3091 
1.7641 
1.3631 
-0.1501 
2.6691 
-0.6981 
-0.2921 
-1.8161 
1.2461 
0.9631 
-1.1381 
0.138< 
3.18K 
0.05K 
1.528( 
1.170< 
0.766( 
1.036( 
0.221( 
-0.388( 
-0.546( 
-2.060.( 
-2.270( 
-2.207( 
1.822( 
-0.987( 
0.754( 



[0.211) 
[0.054) 
[0.065) 
[0.108) 
[0.118) 
;0.077) 
;0.180) 

;o.ii9) 
;o.o80) 

;0.046) 
;0.317) 
;0.165) 

;o.io5) 

;0.840) 

;o.ii2) 
;o.ioo) 

:0.376) 
:0.276) 
:0.182) 
:0.135) 
0.117) 
0.934) 
0.095) 
0.325) 
0.281) 
0.162) 
0.205) 
0.098) 
0.135) 
0.105) 
0.261) 
0.372) 
0.358) 
0.655) 
0.192) 
0.172) 



0.404 
0.409 
0.254 
0.312 
0.374 
0.302 
0.428 
0.393 
0.408 
0.193 
0.223 
0.216 
0.220 
0.233 
0.187 
0.189 
0.483 
0.196 
0.131 
0.180 
0.230 
0.188 
0.201 
0.163 
0.232 
0.167 
0.162 
0.173 
0.327 
0.194 
0.198 
0.465 
0.459 
0.546 
0.432 
0.184 



(0.034) 
(0.020) 
(0.016) 
(0.021) 
(0.023) 
(0.021) 
(0.017) 
(0.026) 
(0.023) 
(0.023) 
(0.016) 
(0.015) 
(0.056) 
(0.026) 
(0.051) 
(0.055) 
(0.071) 
(0.034) 
(0.025) 
(0.048) 
(0.055) 
(0.036) 
(0.035) 
(0.037) 
(0.053) 
(0.036) 
(0.033) 
(0.046) 
(0.064) 
(0.051) 
[0.055) 
[0.071) 
[0.070) 
[0.037) 
[0.065) 
[0.045) 



225 

229 



Table D.15 



NAEP 1988 IRT Parameters, Science Trend Items, Age 13 



Field 


Block Item 


A 




N404501 


SI 


12 


1 153 


(0 0551 


N404601 


SI 


13 


0 318 


(0 039^ 


N404701 


SI 


14 


0 601 


(0 0431 


N404702 


SI 


15 


0 UU9 




N400201 


SI 


16 


0 464 


^0 0411 


N404901 


SI 


17 


0 691 


^0 0S11 


N404801 


SI 


20 


1 372 


(0 0851 


N404802 


SI 


21 


1 610 


1401 

i V . X*T\/ j 


N404803 


SI 


22 


0.956 


(0.078^ 


N405001 


SI 


23 


0 349 




N405101 


SI 


24 


0.794 


(0.052] 


N405201 


SI 


25 


0.315 


(0.036] 


N405301 


SI 


26 


0.623 


(0.049) 


N405401 


SI 


27 


0.801 


(0.053) 


N401201 


SI 


28 


0.544 


(0.049] 


N405501 


SI 


29 


0.628 


(0.052) 


K405601 


SI 


30 


0.233 


(0.034) 


N405701 


SI 


31 


1.012 


(0.067) 


N405801 


SI 


32 


0.493 


(0.044) 


N405901 


SI 


33 


0.637 


(0.049) 


N406001 


SI 


34 


0.455 


(0.107) 


N406101 


SI 


35 


0.531 


(0.120) 


N406201 


SI 


36 


0.360 


(0.089) 


N406301 


S2 


10 


0.356 


(0.052) 


N40e302 


S2 


11 


0.386 


(0.051) 


N406303 


S2 


12 


0.606 


(0.063) 


N406304 


S2 


13 


0.471 


(0.066) 


N406401 


S2 


14 


0.504 


(0.066) 


N406402 


S2 


15 


0.861 


(0.090) 


N406403 


S2 


16 


0.753 


(0.074) 


N406404 


S2 


17 


0.910 


(0.111) 


N406t05 


S2 


18 


0.628 


(0.066) 


K406501 


S2 


19 


0.495 


(0.043) 


N406601 


S2 


20 


0.491 


(0.044) 


N406701 


S2 


21 


0.576 


(0.049) 


K406801 


S2 


22 


1.128 


(0.074) 


N406802 


S2 


23 


0.342 


(0.047) 


N406803 


S2 


24 


0.816 


(0.074) 


N406804 


S2 


25 


1.057 


(0.073) 


N406805 


S2 


26 


1.037 


(0.097) 


N406806 


S2 


27 


0.440 


(0.053) 


N406901 


S2 


28 


0.613 


(0.052) 


N407001 


S2 


29 


0.263 


(0.035) 


N407101 


S2 


30 


0.817 


(0.055) 


(continued) 









B 



-2.021(0.119) 
-0.641(0.084) 
-1.538(0.117) 
-0.140(0.033) 
-1.666(0.151) 
-0.629(0.057) 
-1.624(0.136) 
-0.514(0.077) 
0.240(0.049) 
0.200(0.037) 
0.968(0.077) 
-0.124(0.033) 
1.251(0.107) 
1.138(0.087) 
0.415(0.051) 
-0.031(0.035) 
1.041(0.153) 
0.715(0.065) 



1, 
1, 
4, 
4, 
5, 
■1, 



324(0.124) 
658(0.137) 
846(1.148) 
384(1.008) 
620(1.399) 
563(0.231) 



-0.408(0.069) 
1.470(0.166) 
1.354(0.200) 
-0.157(0.050) 
0.303(0.062) 
-1.328(0.142) 
-0.305(0.067) 
-0.528(0.075) 
0.628(0.064) 
-0.855(0.082) 
0.093(0.034) 
-1.417(0.114) 
0.687(0.104) 
-0.660(0.074) 
-1.014(0.086) 
1.523(0.181) 
0.226(0.050) 
0.019(0.034) 
0.158(0.038) 
2.218(0.168) 



0.164 

0.228 

0.194 

0.201 

0.206 

0.209 

0.422 

0.360 

0.321 

0.214 

0.199 

0.182 

0.199 

0.181 

0.249 

0.197 

0.198 

0.185 

0.166 

0.158 

0.174 

0.207 

0.099 

0.430 

0.428 

0.392 

0.419 

0.461 

0.405 

0.419 

0.457 

0.402 

0.170 

0.175 

0.240 

0.396 

0.445 

0.382 

0.371 

0.550 

0.423 

0.231 

0.182 

0.126 



SE 

(0.042) 
(0.021) 
(0.029) 
(0.018) 
(0.029) 
(0.022) 
(0.043) 
(0.022) 
(0.016) 
(0.017) 
(0.012) 
(0.019) 
(0.012) 
(0.011) 
(0.016) 
(0.019) 
(0.016) 
(0.013) 
(0.012) 
(0.011) 
(0.008) 
(0.008) 
(0.007) 
(0.026) 
CO. 021) 
CO. 013) 
CO. 015) 
CO. 020) 
CO. 018) 
CO. 031) 
CO. 022) 
CO. 025) 
CO. 016) 
CO. 023) 
CO. 016) 
:0.036) 
:0.015) 
:0.022) 
:0.027) 

:o.oii) 
:o.oi7) 
:o.oi7) 
:o.oi9) 
:o.oo9) 



226 



ERIC 



230 



Table D,15 (continued) 





NAEP 


1988 


IRT Parametei 




Xtems y 


Age 


Field 


Block Item 


A 


SE 


B SE 


V* 




N407201 


S2 


31 


0,470 


(0.041) 




Q 207 




N407301 


S2 


32 


0.319 


(0.039) 


1 672C0.208'> 


0 234 




N407302 


S2 


23 


0.346 


(0.046) 


1 8l7f0 245^^ 


0 270 




N408001 


S2 


3^ 


0.848 


(0.050) 


1.268C&.087'> 


0 176 


^0 010'^ 


N407601 


S2 


35 


0.453 


(0 044'^ 


1 743C0 173'> 


0 180 


^0 019 ^ 


N407701 


S2 


37 


0.564 


(0.044) 


1 273^0 107^^ 




^0 01 9^ 


N407801 


S2 


38 


0.666 


(0 055^ 


2 158^0 189'^ 


0 1 QQ 

V/ • X 7 7 


01 0^ 


N407901 


S2 


39 


0.383 


(0.037) 


0 849^0 089'^ 




01 


N408201 


S2 


40 


0.567 


(0.070) 






(C\ 009^ 


N408301 


S3 


10 


0.788 


(0.061) 






ni 


N408302 


S3 


11 


0.708 


(0.065) 








N408303 


S3 


12 


0.647 


(0.060) 








N408304 


S3 


13 


0.971 


ro 079^ 


-1 '384C0 129^ 






N408401 


S3 


14 


0.240 


(0.032) 


-1 476^0 199^^ 




(CS 099^ 


N408501 


S3 


15 


0.733 


(0.056) 


-0 896^0 077^^ 


0 90S 




N408502 


S3 


16 


0.390 


(0.040) 


1.337f0.140'> 


0. 154 


(0.013) 


N408601 


S3 


17 


0.388 


(0.035) 


-1.071(0.102) 


0.153 


(0.022) 


N408701 


S3 


18 


0.346 


(0.038) 


-0.101f0.031^ 


0 212 


(0.018) 


N408801 


S3 


19 


0.174 


(0.030) 


0.655(0.117) 


0.234 


(0.017) 


N408901 


S3 


20 


0.743 


(0.079) 


0.274C0.055> 


0,445 


(0.015) 


N408902 


S3 


21 


0.889 


(0.069) 


-1.740f0.149'> 


0 410 


(0.038) 


N408903 


S3 


22 


0.656 


(0.066) 


0.434(0.062) 


0.404 


(0.015) 


N408904 


S3 


23 


0.540 


(0.060) 


WAV / / ^w* / y 


0 411 


(0.014) 


N409001 


S3 


24 


0.599 


(0.045) 


-0.364(0.040) 


0.163 


(0.019) 


N409101 


S3 


23 


0.635 


(0.045) 


-1.494(0.113) 


0.239 


(0.029) 


N409102 


S3 


26 


0.556 


(0.047) 


0.178(0.036) 


0.229 


(0.016) 


N409103 


S3 


27 


0.518 


(0.059) 


2.017(0.235) 


0.306 


(0.011) 


N409201 


S3 


28 


0.292 


(0.039) 


0.393(0.061) 


0.261 


(0.017) 


N409301 


S3 


29 


0.706 


(0.056) 


-0.145(0.035) 


0.165 


(0.018) 


R409501 


S3 


33 


0.607 


(0.052) 


2.148(0.191) 


0.134 


(0.009) 


N409601 


S3 


34 


0.708 


(0.061) 


1.717(0.162) 


0.290 


(0.011) 


N409701 


S3 


35 


0,633 


(0.060) 


2.485(0.248) 


0.165 


(0.009) 



ERIC 



227 

231 



Table D.16 

NAEP 1988 IRT Parameters, Science Trend Items, Age 17 



Field 


Block 


Item 


A 


Jo 


N400201 


SI 


12 




^0 116^ 

\\* . JL JLU ) 


N404601 

il*T V*T W JL 


SI 


13 


0 S49 


\\J . JL JLO ) 


N410003 


SI 


16 






N410004 


SI 


17 


0 499 


(0 199^ 


N409901 


SI 


18 


0 867 


(0 168^ 


N408601 

^1 *T \y V w \y 


SI 


19 


0 426 


(0 091^ 


N409301 


SI 


20 


0 625 


(0 199^ 

\\J . JL^^ J 


N406301 


SI 


21 


0 334 


(0 091^ 


N406302 


SI 


22 


0 490 


^0 105^ 


N406303 

n \y w *y \y *y 


SI 


23 


0 506 


(0 199^ 

\\/ . JL^ 7 ^ 


N406304 

n \y V/ *y \y "T 


SI 


24 


0 511 


^0 T^8^ 


N410101 


SI 


25 


0 696 




N410102 

H JL \y JL \J tm 


SI 


26 






N410103 


SI 


27 


0 566 


(0 T^9^ 


N406601 


SI 


28 


0 547 




N405001 


SI 


29 


0 462 


(0 098^ 


N401201 


SI 


30 


0.613 


(0.124) 


N405201 


SI 


31 


0.444 


(0.095) 


N410201 


SI 


32 


0.491 


(0.119) 


N406001 


SI 


33 


0.471 


(0.115) 


N409301 


SI 


34 


0.714 


(0.129) 


N406101 


SI 


35 


0.494 


(0.142) 


N406201 


SI 


37 


0.658 


(0.130) 


N408101 


SI 


38 


0.625 


(0.124) 


N406401 


S2 


10 


0.632 


(0.158) 


N406402 


S2 


11 


0.676 


(0.189) 


N406403 


S2 


12 


0.815 


(0.189) 


N406404 


S2 


13 


0.721 


(0.182) 


N406405 


S2 


14 


0.637 


(0.171) 


N410401 


S2 


15 


0.396 


(0.093) 


N406801 


S2 


16 


0.672 


(0.157) 


N406802 


S2 


17 


0.452 


(0.121) 


N406803 


S2 


18 


0.575 


(0.136) 


N406804 


S2 


19 


0.709 


(0.147) 


N406805 


S2 


20 


0.458 


(0.117) 


N406806 


S2 


21 


0.396 


(0.105) 


N410501 


S2 


22 


0.415 


(0.088) 


N410601 


S2 


23 


1.057 


(0.208) 


N410602 


S2 


24 


0.430 


(0.122) 


N410603 


S2 


25 


0.768 


(0,170) 


N410604 


S2 


26 


0.414 


(0.110) 


N406901 


S2 


27 


0.500 


(0.109) 


N407401 


S2 


28 


0.388 


(0.052) 


N407403 


S2 


30 


0.581 


(0.151) 


(continued) 









B 



-1.669(0.370) 0.196 

-0.565(0.150) 0.197 

-1.988(0.486) 0.400 

-1.225(0.334) 0.401 

-0.931(0.209) 0.191 

-1.329(0.295) 0.164 

-1.324(0.274) 0.149 

-1.322(0.371) 0.410 

-0.246(0.118) 0.401 

0.383(0.147) 0.397 

-0.276(0.132) 0.395 

-0.700(0.207) 0.394 

-0.401(0.144) 0.404 

-1.408(0.365) 0.396 

-0.915(0.210) 0.151 

-0.305(0.102) 0.198 

-0.226(0.097) 0.229 

-0.703(0.168) 0.152 

1.890(0.476) 0.199 

2.129(0.536) 0.197 

1.100(0.225) 0.133 

2.885(0.854) 0.214 

2.184(0.457) 0.116 

1.626(0.344) 0.142 

-0.678(0.205) 0.395 

-0.075(0.124) 0.391 

-1.522(0.388) 0.395 

-1.204(0.333) 0.393 

-0.963(0.292) 0.397 

0.086(0.088) 0.244 

-1.921(0.471) 0.396 

1.281(0.367) 0.403 

-1.248(0.314) 0.396 

-1.539(0.344) 0.391 

0.473(0.164) 0.408 

0.270(0.128) 0.406 

-0.420(0.118) 0.150 

2.077(0.507) 0.229 

-2.476(0.714) 0.405 

1.333(0.336) 0.338 

-2.139(0.577) 0.405 

-0.531(0.142) 0.196 

-0.059(0.042) 0.486 

-0.258(0.137) 0.393 



228 



SE 



ERIC 



232 



Table D.16 (continued) 



NAEP 1988 IRT Parameters, Science Trend Items, Age 17 



Field Block Item 



SE 



B 



SE 



SE 



N407404 


S2 


31 


0.714 


(0.166) 


-1.370 


(0.349) 


0.395 


N407201 


S2 


32 


0.500 


(0.106) 


0.120 


(0.084) 


0.153 


N407001 


S2 


33 


0.333 


(0.079) 


-0.920 


(0.232) 


0.155 


N410701 


S2 


34 


0.542 


(0.120) 


0.833 


(0.209) 


0.201 


N407701 


S2 


35 


0.450 


(0.097) 


0.898 


(0.214) 


0.152 


N407301 


S2 


36 


0.346 


(0.083) 


0.510 


(0.147) 


0.204 


N407302 


S2 


37 


0.445 


(0.110) 


0.917 


(0.249) 


0.246 


N407101 


S2 


38 


0.614 


(0.126) 


1.8781 


(0.410) 


0.150 


N410801 


S2 


39 


0.542 


(0.124) 


1.554 


(0.376) 


0.193 


N410901 


S2 


40 


0.707 


(0.134) 


1.777 


(0.367) 


0.155 


N411001 


S2 


41 


0.545 


(0.145) 


2.730 


(0.751) 


0.193 


N408301 


S3 


10 


0.834 


(0.186) 


-0.241( 


(0.125) 


0.381 


N408302 


S3 


11 


0.457 


(0.119) 


-1.685( 


(0.456) 


0.401 


N408303 


S3 


12 


0.543 


(0.122) 


-2.012( 


(0.470) 


0.398 


N408304 


S3 


13 


0.640 


(0.162) 


-i.58rn 


(0.426) 


0.396 


\ll ACT /M 


S3 


14 


0.595 


(0.111) 


0.272< 


(0.103) 


0.235 


N408901 


S3 


15 


0.769 


(0.168) 


-1.2031 


[0.292) 


0.393 


N408902 


S3 


16 


0.836 


(0.165) 


-1.9221 


[0.422) 


0.395 


N408903 


S3 


17 


0.363 


(0.127) 


-0.1721 


[0.112) 


0.392 


Xtf /\ O t\t\f 

N408904 


S3 


18 


0.586 


(0.135) 


-0.3741 


[0.135) 


0.398 


Xti i\C / />T 

N4054U1 


S3 


19 


0.619 


(0.104) 


0.6311 


[0.138) 


0.145 


MA1 1 om 




on 




f u . L5y) 


o . ol4i 


[1 . ioi) 


0.120 


N405501 


S3 


21 


0.584 


(0.121) 


-0.2951 


[0.108) 


0.196 


N411101 


S3 


22 


0.507 


(0.096) 


0.2551 


[0.093) 


0.150 


N411201 


S3 


23 


0.566 


(0.105) 


0.4901 


[0.126) 


0.195 


N408801 


S3 


24 


0.505 


(0.101) 


-0.3401 


[0.105) 


0.198 


N41I401 


S3 


25 


0.846 


(0.151) 


0.534( 


[0.137) 


0.152 


K411501 


S3 


26 


0.860 


(0.125) 


1.749< 


[0.300) 


0.179 


N411502 


S3 


27 


0.619 


(0.131) 


-1.0371 


[0.240) 


0.237 


N411601 


S3 


28 


0.609 


(0.108) 


1.2271 


[0.244) 


0.184 


N411701 


S3 


29 


0.745 


(0.119) 


1.3951 


[0.256) 


0.169 


N411801 


S3 


30 


1.069 


(0.175) 


0.6501 


[0.161) 


0.167 


N411901 


S3 


31 


0.762 


(0.122) 


1.429< 


[0.261) 


0.142 


N412001 


S3 


32 


0.572 


(0.119) 


2.048< 


[0.453) 


0.187 



(0.057) 
(0.035) 
(0.042) 
(0.033) 
(0.032) 
(0.036) 
(0.035) 
(0.026) 
(0.030) 
(0.024) 
(0.024) 
(0.041) 
(0.056) 
(0.057) 
(0.058) 
CO. 035) 
(0.053) 
(0.062) 
CO. 041) 
(0.043) 
(0.031) 
(0.023) 
(0.041) 
(0.035) 
[0.033) 
(0.039) 
(0.030) 
(0.024) 
(0.048) 
(0.030) 
(0.027) 
(0.031) 
[0.025) 
[0.029) 



229 

233 



Table D.17 



N/iEP 1988 Science Trend Conditioning Variables, Age 9 



Estimated 
Variable Effect 



1 


OVERALL 


-0.167629 


2 


GENDER2 


-0.160032 


3 


ETHNIG2 


-0.716027 


4 


ETHNICS 


-0.677694 


5 


ETHNIC4 


-0.143962 


6 


ST0C2 


-0.400385 


7 


ST0C3 


0.114765 


8 


REGI0N2 


0.105314 


9 


REGIONS 


0.202669 


10 


REGI0N4 


0.081810 


11 


PARED2 


0.200699 


12 


PARED3 


0.279235 


13 


PARED4 


0.435635 


14 


PARED_ 


0.172272 


15 


<MODAL GRADE 


-0.498134 


16 


>MODAL GRADE 


1.050936 


17 


ITEMS2 


0.289243 


18 


ITEHSS 


0.478227 


19 


SCH TYP2 


0.076284 



Description 

OVERALL CONSTANT '1' FOR EVERYONE 

GENDER (FEMALE) 

OBSERVED ETHNICITY (BLACK) 

OBSERVED ETHNICITY (HISPANIC) 

OBSERVED ETHNICITY (ASIAN) 

SIZE AND TYPE OF COMMUNITY (LOW METRO) 

SIZE AND TYPE OF COMMUNITY (HIGH METRO) 

REGION (SOUTHEAST) 

REGION (CENTRAL) 

REGION (WEST) 

PARENTS EDUCATION (HIGH SCHOOL GRAD) 
PARENTS EDUCATION (POST HIGH SCHOOL) 
PARENTS EDUCATION (COLLEGE GRAD) 
PARENTS EDUCATION (MISSING, DON'T KNOW) 
MODAL GRADE (LESS THAN MODAL GRADE) 
MODAL GRADE (GREATER THAN MODAL GRADE) 
ITEMS IN THE HOME (YES TO 3) 
ITEMS IN THE HOME (YES TO ALL 4) 
SCHOOL TYPE (NOT PUBLIC) 



ERIC 



230 

234 



Table D.18 



NAEP 1988 Science Trend Conditioning Variables, Age 13 







Estimated 




Variable 


Effect 


1 


OVERALL 


-0.048884 


2 


GENDER2 


-0.267412 


3 


ETHNIC2 


-0.719052 


4 


ETHNIC3 


-0.524609 


5 


ETHNIC4 


0.161636 


6 


ST0C2 


-0.395130 


7 


ST0C3 


-0.007911 


8 


REGI0N2 


-0.077003 


9 


REGI0N3 


0.046762 


10 


REGI0N4 


-0.102571 


11 


PARED2 


0.107733 


12 


PARED3 


0.357308 


13 


PARED4 


0.412279 


14 


PARED_ 


-0.047971 


15 


<HODAL GRADE 


-0.530171 


16 


>MODAL GRADE 


0.969538 


17 


ITEMS2 


0.222418 


18 


ITEiIS3 


0.404732 


19 


SCH TYP2 


0. 128735 



Description 

OVERALL CONSTANT '1' FOR EVERYONE 

GENDER (FEMALE) 

OBSERVED ETHNICITY (BLACK) 

OBSERVED ETHNICITY (HISPANIC) 

OBSERVED ETHNICITY (ASIAN) 

SIZE AND TYPE OF COMMUNITY (LOW METRO) 

SIZE AND TYPE OF COMMUNITY (HIGH METRO) 

REGION (SOUTHEAST) 

REGION (CENTRAL) 

REGION (WEST) 

PARENTS EDUCATION (HIGH SCHOOL GRAD) 
PARENTS EDUCATION (POST HIGH SCHOOL) 
PARE!..'' EDUCATION (COLLEGE GRAD) 
PARENTS EDUCATION (MISSING, DON'T KNOW) 
MODAL GRADE (LESS THAN MODAL GRADE) 
MODAL GRADE (GREATER THAN MODAL GRADE) 
ITEMS IN THE HOME (YES TO 3) 
ITEMS IN THE HOME (YES TO ALL 4) 
SCHOOL TYPE (NOT PUBLIC) 



ERIC 



231 



235 



Table D.19 



NAEP 1988 Science Trend Conditioning Variables, Age 17 







Estimated 




Variable 


Effect 


1 


OVERALL 


-0.018353 


2 


GENDER2 


-0.422265 


3 


ETHNIC2 


-0.675393 


4 


ETHNICS 


-0.028940 


5 


ETHNIG4 


0.105174 


6 


ST0C2 


-0.215624 


7 


ST0C3 


0.200910 


8 


REGI0N2 


-0.078230 


9 


REGIONS 


-0.145136 


10 


REGI0N4 


-0.156447 


11 


PARED2 


0.277744 


12 


PARED3 


0.50692-3 


13 


PARED4 


0.724225 


14 


PARED_ 


-0.353136 


15 


<MODAL GRADE 


-0.540566 


16 


>MODAL GRADE 


0.345666 


17 


ITEMS2 


0.091730 


18 


ITEMS3 


0.208488 


19 


SCH TYP2 


-0.094395 



Description 

OVERALL CONSTANT '1' FOR EVERYONE 
GEHDKR (FEMALE) 
OBSERVED ETHNICITY (BLACK) 
.OBSERVED ETHNICITY (HISPANIC) 
OBSERVED ETHNICITY (ASIAN) 
SIZE AND TYPE OF COMMUNITY (LOW METRO) 
SIZE AND TYPE OF COMMUNITY (HIGH METRO) 
REGION (SOUTHEAST) 
REGION (CENTRAL) 
REGION (WEST) 

PARENTS EDUCATION (HIGH SCHOOL GRAD) 
PARENTS EDUCATION (POST HIGH SCHOOL) 
PARENTS EDUCATION (COLLEGE GRAD) 
PARENTS EDUCATION (MISSING, DON'T KNOW) 
MODAL GRADE (LESS THAN MODAL GRADE) 
MODAL GRADE (GREATER THAN MODAL GRADE) 
ITEMS IN THE HOME (YES TO 3) 
ITEMS IN THE HOME (YES TO ALL 4) 
SCHOOL TYPE (NOT PUBLIC) 



232 



APPENDIX E 

Estimation of the Standard Emrors 
of the Adjusted 1986 NAEP Results 



233 

. 237 



Appendix E 

ESTIMATION OF THE STANDARD ERRORS OF THE ADJUSTED 1986 NAEP RESULTS 



Eugene G. Johnson 
Robert J. Mis levy 
Rebecca Zwick 



Common- population linear equating of the results from the 1988 bridges 
was used to link the 1986 results to the 1984 reading scale. The procedures 

A 

described below were carried out for each age cohort independently. Let /i^ 

A 

and Gi be, respectively, the estimated mean and standard deviation of the 
proficiency scores from the 1988 bridge to 1986, these values being in the 86- 

. A A 

P provisional metric^ (see Chapter 6) . Let ^2 ^® estimated mean 

and standard deviation of the proficiency scores from the 1988 bridge to 1984; 
these values being on the same metric as the 1984 reading scale^. The common- 
population linear equating of the two sets of bridge values comes about by 
matching fhe estimated moments for the two bridges, producing the following 
equating function for going from the 1986 (86-P) metric to the 1984 metric: 

f(e. A, B) « A^ + B (1) 



ERIC 



^ that is, with reference to the item parameters estimated from the 1986 
data only. 

^ that is, with reference to item parameters estimated from the 1984 data 

only. 

235 

23S 



where 

e is a proficiency value in the 1986 metric, 

AAA 

^ ^2/ t and 

A A AAA 



Equacion (1) is used to produce adjusted proficiency values for the 1986 
assessment. In particular, let X be the estimated 1986 mean proficiency for 
some subgroup of the population (or for the population as a whole) , this 
estimate based on the proficiency values in the provisional 1986 metric. The 
adjusted estimate of the 1986 mean proficiency of the subgroup, in the metric 
of the 1984 reading scale, is 

X^dj - f(X, A, B) - AX + B. (2) 

A A A A A 

If Ml, <^if M2» ^2 ^^^» consequently, A and B were known without error, 
the variance of X^^j would be simply 

A2Var(X). (3) 

A A 

However, since A and B are based on estimates from samples of the 1988 
population, they are subject to variability. Ignoring this variability by 
using (3) as the estimate of the variance of the adjusted subgroup mean will 
result in an underestimate of the true variability of X^^y 

A large sample approximation to the variance of X-hi is 



f Mn. 
I ax 



d(f) dcn 
aA as 



f sm. mi mi V 

I ax aA as J 



[ A X 1 ] s [ 



A X 1 ]• 



ERIC 



236 

239 



where 



S - Var( [ X A B ] ) 



0 

wita 

SjQj - Var(X), - Var(A) , Sbb " Var(B) and S^b - Sba " Cov(A,B). 

A A 

Since the factors A and B are derived from the 1988 bridge assessments, and 
are consequently independent of the value X from the 1986 assessment, the 

« A ^ A 

covariances between X and A and X and B are zero* 
Thus 

Var(X^dj) ^ A^Sxx + X^S^a + 2X71^ + Sbb- W 
An estimate of Sjqj comes by applying the jackknife technology (E. G. 

— A A 

Johnson, 1987b) to the estimate X. Since the factors A and B are each 
functions of the bridge sample means and standard deviations, estimates of 

A A 

^AA» S^ and SgB can be obtained by expressing A and B in terms of the vector 

A A A A 

* - [ A*2> ^2 ] 

and applying the delta method to the result. This produces 

r 3A 3B ] r M ^ 1^ 

[ a>p a>p J ^ L 3^ J (5) 



Saa Sab 



- Sba SgB 

where is the 4x4 variance -covariance matrix of "9. (In S^, the 
co'^^ariances ueCween the terms based on the bridge to 1986 and the terms based 
on the bridge to 1984 are taken to be 0, since these two bridges are 
independent samples*) 

Estimates of the various terms in S^ can be obtained by the jackknife 
repeated replications technique. However, it is known (Hosteller & Tukey, 
1977) that the jackknife procedure has relatively poorer performance in 



237 



ERLC 



240 



estimating the variance of a statistic with a markedly nonsymmetric 
distribution. Consequently, the jackknife estimates of the variances of 

A 

and ^2 would be expected to be of lower quality than the jackknife variance 

A A 

estimates of and 

Since the jackknife performs better when the distribution of the 
s'tatistic in question is symmetric, it is preferable to apply a symmetrizing 
transformation to the standard deviations and ^3, obtain the jackknife 
variance estimates of the transformed statistics, and reexpress (5) to account 
for the transformation. A transformation of a variance statistic which 
promotes symmetry is the Wilson-Hilferty cube-root transformation (Kendall & 
Stuart, 1977, p. 400). 

Let 

S - [Pi, a>i, ^] 
where 10^ - a^^^^ and 0)2 - az^^^ are the Uilson-Hilferty transformed values of 

A 

and Then 

A - (uj/wi)^^^ and 
^ 'k - k (w2/tOi)'/2 



so that 



r M M 1 ^ r M as 1' 

L as as J |_ as as J 



(6) 



vjhera 



0 
0 



0 
0 



0 
0 



■'»2<«2 



0 
0 



V2 



238 



ERIC 



241 



Since M - f n 14 a 14 

as "2«i " 2«2 



and S3. - 

as 



*> A A Am 

r .J 14/11 , 14/12 1 

L " 2«i ■2«2 J, 



from (3) we have 

Sm « (9/4) a2( S„^„y + Si.2«2/"2' ) 

Sab " -(9/4) A^^ij ( S„ „ /ui^ + SL.,o, /"z^ > . (taking r„ „ and S„ to 
be zero) , and 

Sbb « a2 1.^^^^ + S^^^ + (9/4) a2 ( SL.^„^ / + 2„^^^ / ) . 
Inserting these approximations into (4) produces the following estimate 
of the variance of the 1986 adjusted means: 

Var( X„dj ) « a2 ( 2xx + ) + ^^^^ + 

(9/4) a2 (X - ;ii)2 ( / + S„^„^ / ) (7) 
and the standard error of the adjusted mean is the square root of this 
variance estimate. 

Ah idea of the effect of using the square root of equation (7) as an 
estimate of the standard error of an adjusted mean, rather than the 
traditional estimate based on equation (3) can be obtained by ccaaparing the 
two standard error estimates. Table E.l does this for the standard errors of 
the 1986 adjusted mean scores for students of age 9, 13 and 17. 



239 

242 



Table E.l 

Comparison of Standard Errors for the 1986 Adjusted Mean Scores 
SE7_1 SE3-I Ratio 

Age 9 1.9 1.2 1.58 

Age 13 1.6 1.0 1.60 

Age 17 1.7 1.1 1.55 



* Standard error computed using equation 7 
Standard error computed using equation 3 

We see thai: the effect of acknowledging that the parameters in the 
equating function (1) are subject to variability is to multiply the estimate 
of the standard error of the population mean proficiency estimate by about 
1.6. Viewed in another way, the traditional estimate of the standard error of 
an adjusted mean proficiency value may underestimate the standard error by a 
factor of 1.6 so that the variance estimate based on (7) is two and a half 
times the size of that based on (3) . This is largely because the traditional 
variance estimate only considers the variance of X while the more proper 
variance is essentially the sum of the (appropriately scaled) variances of the 

_ A A 

three means: X, /i^, and /ig* 



240 



REFERENCES 



ERIC 



241 



1 



244 



REFERENCES 



Applebee, A. N, , Langer, J, A,, & Mullis, I, V, S, (1988), Who reads best? 

Factors related to reading achievement in grades 3, 7, and 11. (No, 17- 
R-01) Princeton, NJ: National Assessment of Educational Progress, 
Educational Testing Searvice, 

Baratz-Snowden, J, C, Rock, D, , Pollack, J,, & Wilder, G, (1988), The 

educational progress of language minority chil ^ren: Findings from the 
NAEP 1985-86 special study. Princeton, NJ: Educational Testing 
Service. 

Beaton, A. E. (1987). Implementing the' new design: The NAEP 1983-84 

technical report. (No, 15-TR-20) Princeton, NJ: Educational Testing 
Service. 

Beaton, A. E. (1988a). The NAEP 1985''86 reading anomaly: A technical 
report. Princeton, NJ: Educational Testing Service. 

Beaton, A. E. (1988b). Expanding the new design: The NAEP 1985-86 technical 
report. (No. 17-TR-20) Princeton, NJ; Educational Testing Service. 

Cochran, W. G. , et al. (^954). Statistical problems of the Kinsey report on 
the sexual behavior in the human male. Washington, DC: American 
Statistical Association. 

Cronbach, L. J., Gleser, G. C. , Nanda, R, , & Rajaratnam, N. (1972). 
The dependability of behavioral measurements: Theory of 
generalizability for scores and profiles. New York: vliley. 

Dossey, J. A., Mullis, I. V. S., Lindquist, M. M. , & Chambers, D. L. (1988). 

The mathematics report card: Are we measuring up? Trends and achievement 
based on the 1986 National Assessment. (No. 17-M-Ol) Princeton, NJ: 
National Assessment of Educational Progress. 

Ferris, J. J. (1988). Quality control of data entry. In A. E. Beaton, 

Expanding the new design: The NAEP 1985-86 technical report (pp. 143- 
146). (No. 17-TR-20) Princeton, NJ: Educational Testing Service. 

Haertel, E. (Chair). (1989). Report of the NAEP technical review panel on 

the 1986 reading anomaly, the accuracy of NAEP trends, and issues raised 
by state-level NAEP comparisons. National Center for Education 
Statistics Technical Report CS 89*499. Washington, DC: U. S. 
Department of Education, Office of Educational Research and Development. 



ERIC 



243 

245 



Hedges, L. V. (1989). The NAEP/ETS report on the 1986 reading data anomaly: 
A technical critique. In E. Haertel (Chair), Report of the NAEP 
technical review panel on the 1986 reading anomaly, the accuracy of NAEP 
trends, and issues raised by state-level NAEP comparisons (pp. 75-84). 
National Center for Education Statistics Technical Report CS 89-499. 
Washington, DC: U. S. Department of Education, Office of E^acational 
Research and Development. 

Johnson, E. G. (1987a). Design effects. In A. E. Beaton, Implementing the 
new design : The NAEP 1983-84 technical report (pp. 545-562). (No. 15- 
TR-20) Princeton NJ: Educational Testing Service. 

Johnson, E. G. (1987b). Estimation of uncertainty due to sampling 

variability. In A. E. Beaton, Implementing the new design : The NAEP 
1983-84 technical report (pp. 505-512). (No. 15-TR-20) Princeton NJ: 
Educational Testing Service. 

Johnson, E. G, (1988a). The NAEP populations and samples. In A. E. Beaton, 
The NAEP 1985-86 reading anomaly: A technical report (pp. 21-34). 
Princeton, NJ: Educational Testing Service. 

Johnson, E. G. (1988b). Attributes of low-scoring students. In A. E. 

Beaton, The NAEP 1985-86 reading anomaly: A technical report (pp. 35- 
43). Princeton, NJ: Educational Testing Service. 

Johnson, E. G. (1988c). Item data analyses. In A. E. Beaton, The NAEP 1985- 
86 reading anomaly: A technical report (pp. 65-77). Princeton, NJ: 
Educational Testing Service. 

Johnson, E. G. (1988d) . Mathematics data analysis. In A. E. Beaton, 

Expanding the new design: The NAEP 1985-86 technical report (pp. 215- 
240). (No. 17-TR-20) Princeton, NJ: Educational Testing Service. 

Johnson, E. G., Burke, J., Braden, J., Hansen, M. H. , Lago, J. A., and 

Tepping, B. J. (1988). Weighting procedures and variance estimation. 
In A. E. Beaton, Expanding the new design: The NAEP 1985-86 technical 
report (pp. 273-291). (No. 17-TR-20) Princeton, NJ: Educational 
Testing Service. 

Johnson, J. R. (1987). Instrument and item information. In A. E. Beaton, 
Implementing the new design: The NAEP 1983-84 technical report (pp. 
119-134). (No. 15-TR-20) Princeton, NJ: Educational Testing Service. 

Johnson, J. R. (1983). Booklet format and administration. In A. E. Beaton, 
The NAEP 1985-86 reading anomaly: A technical report (pp. 53-57). 
Princeton, NJ: Educational Testing Service. 

Kendall, M. , & Stuart, A. (1977). The advanced theory of statistics. Volume 
I, 4th edition. New York: MacMillan. 



ERIC 



244 



24, 



n 

9 



Leary, L* F., & Dorans, N. J. (1985). Implications for altering the context 
in which test items appear: A historical perspective on an Immediate 
concern. Review of Educational Research, 55, 387-413. 

Messick, S. J., Beaton A. E., & Lord, F. M. (1983). UAEF reconsidered: A 
new design for a new era. (NAEP Report 83-1) Princeton, NJ: 
Educational Testing Service. 

Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49, 
359-381. 

Mislevy, R. J. (1988). Scaling procedures. In A. E. Beaton, Expanding the 
new design: The NAEP 1985-86 technical report (pp. 177-204). (No. 17- 
TR-20) Princeton, NJ: Educational Testing Service. 

Mislevy, R. J., & Bock, R. D. (1982). BILOG: Item analysis and test scoring 
with binary logistic models '[Computer program]. Mooresville, IN: 
Scientific Software. 

Mislevy, R. J., & Sheehan, K. M. (1987). Marginal estimation procedures. In 
A. E. Beaton, Implementing the new design: The NAEP 1983-8^ technical 
report (pp. 293-360). (No. 15-TR-20) Princeton, NJ: Educational 
Testing Service. 

Hosteller, F., & Tukey, J. W. (1977). Data analysis and regression. 
Reading, MA: Addison-Wesley. 

Mullis, I. V. S., & Jenkins, L. B. (1988). The science report card: Elements 
of risk and recovery. Trends and achievement based on the 1986 National 
Assessment. (No. 17-S-Ol) Princeton, NJ: National Assessment of 
Educational Progress. 

Mullis, I. V. S., & Jenkins, L. B. (1990) The reading report card, 1971 to 
1988: Trends from the Nation^ s Report Card. Princeton, NJ: 
Educational Testing Service. 

The reading report card: Progress toward excellence in our schools. (1985). 
(NAEP Report 15-R-Ol). Princeton, NJ: Educational Testing Service. 

Schrader, W. B. (1988). [Review of NAEP 1986 database] In A. E. Beaton, The 
NAEP 1985-86 reading anomaly: A technical report (pp. 97-98). 
Princeton, NJ: Educational Testing Service. 

Sheehan, K. M. (1985) H-Group: Estimation of group effects in multivariate 
models [Computer program]. Princeton, NJ: Educational Testing Service. 

Sheehan, K. M. (1988). The IRT linking procedure used to place the 1986 
intermediary scaling results onto the 1984 reading calibration scale. 
In A. E. Beaton, Expanding the new design: The NAEP 1985-86 technical 
report (pp. 555-565). (No. 17-TR-20) Princeton, NJ: Educational 
Testing Service. 



245 



ERiC 



-'^ 247 



Sheehan, M , & Hislevy, J. (1988) • Some consequences of the 

xmcertainty in IRT linking procedures. (ETS Research Report RR- 88-38 - 
OKR) Princeton, NJ: Educational Testing Service* 



Slobasky, R. (1988). [Field administration factors and the 1986 NAEP reading 
scores] In A. E. Beaton, The NAEP 1985-86 reading anomaly: A technical 
report (pp. 99-101), Princeton, NJ: Educational Testing Service. 

Wise, L. L. , Chia, W. J., and Park, R. K. (1989). Item position effects for 
test of word knowledge end arithmetic reasoning. Paper presented at the 
annual meeting of the American Educational Research Association, March 
1989, Sail Francisco. 

Yamamoto, K. (1988). Science data analysis. In A. E. Beaton, Expanding the 
new design: The NAEP 1985-86 technical report (pp. 243-255). (No. 17- 
TR-20) Princeton, NJ: Educational Testing Service. 

Zwick, R. (1988a). Block and booklet analyses. In A. E. Beaton, The NAEP 
1985-86 reading anomaly: A technical report (pp. 79-86). Princeton, 
NJ: Educational Testing Service. 

Zwick, R. (1988b). Reading data analysis. In A. E. Beaton, Expanding the 
new design: The NAEP 19S5-o6 technical report (pp. 207-212). (No. 17- 
TR-20) Princeton, NJ: Educational Testing Service. 



246 



ERiC 248 



